VMware vSphere® Metro Storage Cluster (vMSC)
2.40

Problems that solves

Aging IT infrastructure

Values

Reduce Costs

Enhance Staff Productivity

Ensure Security and Business Continuity

VMware vSphere® Metro Storage Cluster (vMSC)

A VMware vSphere Metro Storage Cluster configuration is a vSphere certified solution that combines replication with array-based clustering. These solutions are typically deployed in environments where the distance between data centers is limited, often metropolitan or campus environments.

Description

VMware vSphere® Metro Storage Cluster (vMSC) is a specific configuration within the VMware Hardware Compatibility List (HCL). These configurations are commonly referred to as stretched storage clusters or metro storage clusters and are implemented in environments where disaster and downtime avoidance is a key requirement. This best practices document was developed to provide additional insight and information for operation of a vMSC infrastructure in conjunction with VMware vSphere. This paper explains how vSphere handles specific failure scenarios, and it discusses various design considerations and operational procedures. vMSC infrastructures are implemented with a goal of reaping the same benefits that high-availability clusters provide to a local site, in a geographically dispersed model with two data centers in different locations. A vMSC infrastructure is essentially a stretched cluster. The architecture is built on the premise of extending what is defined as “local” in terms of network and storage to enable these subsystems to span geographies, presenting a single and common base infrastructure set of resources to the vSphere cluster at both sites. It in essence stretches storage and the network between sites. The primary benefit of a stretched cluster model is that it enables fully active and workload-balanced data centers to be used to their full potential while gaining the capability to migrate virtual machines (VMs) with VMware vSphere vMotion®, and VMware vSphere Storage vMotion®, between sites to enable on-demand and nonintrusive mobility of workloads. The capability of a stretched cluster to provide this active balancing of resources should always be the primary design and implementation goal. Although often associated with disaster recovery, vMSC infrastructures are not recommended as primary solutions for pure disaster recovery. Stretched cluster solutions offer the following benefits: Workload mobility • Cross-site automated load balancing • Enhanced downtime avoidance • Disaster avoidance Technical Requirements and Constraints Storage connectivity using Fibre Channel, iSCSI, NFS, and FCoE is supported. • The maximum supported network latency between sites for the VMware ESXi™ management networks is 10ms round-trip time (RTT). • vSphere vMotion, and vSphere Storage vMotion, supports a maximum of 150ms latency as of vSphere 6.0, but this is not intended for stretched clustering usage. • The maximum supported latency for synchronous storage replication links is 10ms RTT. Refer to documentation from the storage vendor because the maximum tolerated latency is lower in most cases. The most commonly supported maximum RTT is 5ms. • The ESXi vSphere vMotion network has a redundant network link minimum of 250Mbps. The storage requirements are slightly more complex. A vSphere Metro Storage Cluster requires what is in effect a single storage subsystem that spans both sites. In this design, a given datastore must be accessible—that is, be able to be read and be written to—simultaneously from both sites. Further, when problems occur, the ESXi hosts must be able to continue to access datastores from either array transparently and with no impact to ongoing storage operations. This precludes traditional synchronous replication solutions because they create a primary–secondary relationship between the active (primary) LUN where data is being accessed and the secondary LUN that is receiving replication. To access the secondary LUN, replication is stopped, or reversed, and the LUN is made visible to hosts. This “promoted” secondary LUN has a completely different LUN ID and is essentially a newly available copy of a former primary LUN. This type of solution works for traditional disaster recovery–type configurations because it is expected that VMs must be started up on the secondary site. The vMSC configuration requires simultaneous, uninterrupted access to enable live migration of running VMs between sites. The storage subsystem for a vMSC must be able to be read from and write to both locations simultaneously. All disk writes are committed synchronously at both locations to ensure that data is always consistent regardless of the location from which it is being read. This storage architecture requires significant bandwidth and very low latency between the sites in the cluster. Increased distances or latencies cause delays in writing to disk and a dramatic decline in performance. They also preclude successful vMotion migration between cluster nodes that reside in different locations.