NutaNIX · Module 7: Data Protection and DR

The Promise

By the end of this module you will:

Design a multi-site DR architecture in 30 minutes. Pick the replication mode (Async, NearSync, Metro), define RPO and RTO targets, choose between Protection Domains and Protection Policies, and produce a sized bandwidth requirement. One of the highest-leverage SA conversations.
Pass roughly 22% of NCP-MCI and 25% of NCM-MCI. Data protection is one of the heaviest single-topic weights on these exams. NCM-MCI labs frequently include configure-a-policy / set-up-replication / test-failover / troubleshoot-a-stalled-job DR scenarios.
Defend Recovery Plans against an SRM-loyal customer. SRM has 15+ years of polish. Recovery Plans is younger. The honest comparison is that for AHV-native deployments Recovery Plans is integrated and capable; for established SRM shops, the migration is real work and coexistence is often the right answer.
Make the cloud-DR case using NC2 (Nutanix Cloud Clusters on AWS or Azure) for customers who want a DR site without a second datacenter. By 2026, this is one of the more compelling Nutanix-specific stories.
Walk the snapshot, replication, and recovery sequence end-to-end. When DR breaks, the customer's first call is to you. Know where to look: snapshot schedule status, replication queue depth, Recovery Plan validation, network path between sites.
Make RPO/RTO/cost tradeoffs explicit. The customer who wants 1-minute RPO and 30-minute RTO and zero cost increase is asking for something that does not exist. Translate business requirements into the right product mix at a real budget.

DR is the dimension that decides enterprise deals. Customers who survived a real outage have opinions; customers who haven't are about to. Either way, when DR comes up, you need to know this material cold.

Foundation: What You Already Know

You have built or maintained DR for VMware. The pieces:

VMware snapshots for point-in-time recovery (with the consolidation pain Module 3 already covered).
vSphere Replication for VM-level replication to a secondary site.
Storage array replication (NetApp SnapMirror, EMC RecoverPoint, Pure ActiveCluster) for storage-level replication, often with better RPO than VR.
Site Recovery Manager (SRM) for orchestrating failover: VM boot order, IP remapping, runbook automation, test recovery.
Backup tooling (Veeam, Commvault, Rubrik, Cohesity) layered on top for longer-retention backup and granular recovery.

You know the operational realities. RPO depends on the replication mechanism. RTO depends on how fast you can boot VMs at the DR site, with what network reconfiguration, with what application checks. SRM runbook tests are the only thing that proves DR works, and most customers do not run them as often as they should.

Hold that experience. Nutanix's data protection stack reorganizes those pieces. Replication is built into the platform (no separate appliance). Orchestration is integrated (no separate SRM purchase). Snapshots are DSF-native (no consolidation pain). Cloud DR is NC2 (no need to build a second datacenter). The pieces are different but the operational concepts transfer.

Core Content

Snapshots Revisited: Crash-Consistent vs Application-Consistent

Module 3 covered DSF snapshots as a DSF-native, instant, no-I/O-penalty operation. Layer the consistency dimension on top.

Crash-consistent snapshots. Capture the state of the vDisk at a moment in time. From a guest's perspective this is equivalent to pulling the power cord and the system coming back up: file systems may need recovery on first boot, in-memory data is lost, transactional databases may roll back recent transactions. For most workloads, crash-consistent recovery is sufficient.

Application-consistent snapshots. The snapshot is taken after the application has flushed its in-memory state to disk. On Windows, this requires VSS (Volume Shadow Copy Service) coordination, which Nutanix Guest Tools (NGT) provides. On Linux, application-consistent snapshots typically rely on application-level mechanisms (a database's quiesce command) coordinated through NGT.

When does the distinction matter?

Database VMs (SQL Server, Oracle, PostgreSQL). Application-consistent snapshots are strongly preferred for direct restoration. Crash-consistent recovery requires the database's startup recovery to succeed, which usually works but is slower and occasionally messy.
File servers. Crash-consistent is fine for general SMB/NFS workloads.
Application servers (web, app tier). Crash-consistent is fine if the application is stateless or has its state in a backed-up database.
Active Directory domain controllers. AD has its own consistency protocols; crash-consistent is acceptable but Microsoft's preferred approach involves database-aware snapshots.

Protection Domains vs Protection Policies

This is the architectural shift that matters operationally and exam-wise. Read carefully.

Protection Domains (PDs). The legacy construct, configured in Prism Element. A PD is a group of VMs (and/or vDisks) that share a single replication and snapshot schedule. You create a PD, add VMs to it, attach a remote site, configure the schedule (e.g., snapshot every hour, replicate every 4 hours), and the PD enforces it. PDs have been the workhorse of Nutanix DR for years.

Protection Policies. The modern construct, configured in Prism Central. Policies are category-driven (Module 4). A Protection Policy says: "All VMs with Environment: Production get hourly snapshots retained for 7 days, with NearSync replication to the DR site." VMs are assigned to the policy by their categories, not by manual addition. Policies are tied to Recovery Plans (Nutanix Disaster Recovery, formerly Leap) for orchestrated failover.

Dimension	Protection Domain	Protection Policy
Where configured	Prism Element	Prism Central
Membership model	Manual VM addition	Category-driven (auto)
Failover orchestration	Native PD failover (basic)	Recovery Plans (rich orchestration)
Multi-site management	Per-PD, per-cluster	Centralized in PC
Recommended for new deployments	No	Yes

Existing customers: PDs continue to work. There is no forced migration. Many production environments still run PDs because the migration to Policies is a project that requires planning. New deployments should start with Policies.

Async Replication: The Default Workhorse

Async replication is the bread-and-butter Nutanix DR mechanism. The pattern:

Take a snapshot at the source cluster on schedule (e.g., hourly).
Compute the delta between this snapshot and the previous replicated one.
Send the delta over the network to the destination cluster.
The destination cluster materializes the new snapshot.
Retain snapshots per the configured retention policy.

Characteristics:

Minimum RPO: typically 1 hour, configurable down to 15 minutes in some scenarios. RPO is governed by the snapshot/replication interval, not by inherent technology limits.
Bandwidth efficiency: delta-based, so steady-state bandwidth scales with change rate, not with total data size.
Network requirements: any IP-routable connection between sites. Latency is forgiving (works fine over WAN with hundreds of milliseconds RTT).
Cost: lowest of the three replication modes. Suitable for most workloads.

Topology options:

One-to-one. Single primary, single DR site.
One-to-many. Single primary replicating to multiple DR sites (different RPOs to each).
Many-to-one. Multiple production clusters replicating to a consolidated DR site (common for ROBO consolidation).
Bi-directional. Two clusters protecting each other.

Most customers start here. Most stay here for the bulk of their workloads.

NearSync Replication: When 1-Hour RPO Isn't Enough

NearSync is Nutanix's "almost-synchronous" replication mode. It uses LWS (Light-Weight Snapshots) under the hood, taking very frequent (sub-minute) metadata-level micro-snapshots and continuously replicating them. The LWS store is allocated on the cluster's SSD tier; that is where every NearSync-protected change lands first, before propagating to the destination cluster.

Characteristics:

RPO: as low as 20 seconds in optimal configurations; commonly designed for 1-15 minute RPO.
Bandwidth requirements: higher than Async. The replication is more continuous and less coalesced. Plan for sustained bandwidth roughly proportional to the workload's write rate.
Network requirements: typically <5 ms RTT to the destination, though specific platform versions vary. Some configurations relax this.
Cluster overhead: higher than Async. NearSync places more load on Stargate and Curator at the source cluster.
Cost: higher than Async, lower than Metro.

When to use NearSync:

Tier-1 production databases where 1-hour RPO is too long.
Compliance-driven workloads with sub-15-minute RPO mandates.
Customers with adequate bandwidth and acceptable latency between sites.

The honest gotcha: NearSync's resource cost on the cluster is real. Customers who try to NearSync-protect their entire estate often find they need bigger CVMs or face cluster headroom issues. Use NearSync for the workloads that genuinely need it; leave the rest on Async.

Metro Availability: Synchronous, Zero RPO

Metro Availability is true synchronous replication: every write at the source is replicated to the destination before it is acknowledged. RPO is zero (no data loss on any single-site failure).

Characteristics:

RPO: zero. Synchronous replication.
Latency requirement: ≤5 ms RTT is the documented hard ceiling. Production design targets typically aim for ≤3.5 ms RTT under load, with the 5 ms number reserved as the headroom ceiling. P99.9 latency under concurrent I/O matters more than the average; sustained micro-bursts can push average-3 ms links over the 5 ms threshold during peak load. Real-world deployments are metro-area distances (campus, dual-datacenter within a city).
Topology: active-standby (typical) or active-active (advanced configurations).
Witness VM: required for split-brain protection. Witness runs on a third site (a small cluster, a separate Nutanix instance, or sometimes a public-cloud VM) and provides quorum during partition events.
Cost: highest of the three modes. Bandwidth + tightly-coupled networking + witness infrastructure.

When to use Metro:

Mission-critical workloads with zero-data-loss requirements.
Active-active datacenter architectures where workloads run in both sites simultaneously.
Compliance frameworks that mandate synchronous replication (less common, but exists).

Important constraint: Metro is only useful within metro-area latency (typically <100 km). It is not a long-distance DR solution. For wide-area DR, you combine Metro (between two close sites) with Async or NearSync to a third remote site for full geographic resilience.

Diagram: Replication Topologies (Async / NearSync / Metro)

Whiteboard ready NCP-MCI NCM-MCI NCP-NS

Three replication modes at a glance. RPO, latency budget, bandwidth, and cluster overhead in one frame. Most customers run a mix: Async for bulk, NearSync for Tier-1 databases, Metro for genuine zero-RPO use cases.

The Cycle, Frame Two: DR as RPO/RTO Mapped to Products

For an operations leader, the durable DR frame is mapping business requirements to technology choices.

Application Tier	Business RPO	Business RTO	Recommended Approach
Mission-critical, zero-data-loss	0	<30 min	Metro Availability (campus) + Async to third site
Tier-1 production DB	1-15 min	1-2 hours	NearSync to DR cluster + Recovery Plan
Production general-purpose	1-4 hours	2-4 hours	Async + Recovery Plan
Test / Dev	8-24 hours	4-8 hours	Async with longer interval, or no DR
Ephemeral / stateless	n/a	redeploy	No replication; redeploy from source-of-truth

This is the design conversation in 15 minutes. Walk the customer through their tiers, agree on RPO/RTO targets, map to the technology, and the architecture writes itself.

The Cycle, Frame Three: Recovery Plans (NDR) as the SRM Replacement

Recovery Plans are the runbook construct inside Nutanix Disaster Recovery (the current product name; formerly branded Leap). The product was renamed from Leap to Nutanix Disaster Recovery a few years back; you will still see "Leap" in older docs, in customer vocabulary, and in some current-day Nutanix product surfaces. Recovery Plans live in Prism Central and define:

What VMs are protected (via category membership).
The startup order (which VMs come up first, second, third).
Network mapping (production VLAN 100 maps to DR VLAN 200).
IP address remapping (or DHCP-based reassignment at the DR site).
Pre-checks and post-checks (run a script before failover, run a script after).
Test failover capability (run a failover into an isolated network, validate, tear down).
Manual or automated failover triggers.

This is the SRM equivalent. The functional comparison:

Feature	Site Recovery Manager (SRM)	Recovery Plans (NDR / Leap)
Hypervisor support	ESXi only	AHV native; ESXi-on-Nutanix supported via integration
Replication source	vSphere Replication or array-based	Native Nutanix (Async, NearSync, Metro)
Licensing	Separate VMware product	Bundled with Nutanix Cloud Manager (varies by tier)
Test failover	Yes, mature	Yes
Runbook orchestration	Mature, deeply customizable	Capable, less customizable in advanced edge cases
IP remapping / DHCP	Yes	Yes
Pre/post-failover scripts	Yes	Yes
Multi-site planning	Yes	Yes
Maturity	15+ years	5+ years, rapidly improving
Cross-site management	vCenter-driven	Prism Central

The honest comparison: SRM is more mature and has more advanced runbook customization for complex scenarios. Recovery Plans is integrated, free or bundled, and simpler to operate. For most enterprise DR requirements, Recovery Plans is sufficient. For customers with established SRM deployments, the migration is real work and the coexistence pattern (SRM on ESXi-on-Nutanix, Recovery Plans for AHV workloads) is often the right answer.

Diagram: Protection Policies and Categories

NCP-MCI NCM-MCI

Category assignment is the policy enrollment moment. Tag a VM Production, it is protected within minutes. No manual addition. No forgotten VMs. Recovery Plans orchestrate the failover of category-protected VMs.

NC2: DR to the Cloud Without a Second Datacenter

NC2 (Nutanix Cloud Clusters) runs the Nutanix platform on AWS or Azure bare-metal hosts. From the platform's perspective, an NC2 cluster looks like any other Nutanix cluster: AOS, AHV, CVMs, DSF, Prism. From the customer's perspective, it is Nutanix infrastructure they pay for as a cloud-consumption model rather than as on-premises hardware.

For DR specifically, NC2 enables:

Replicate from on-prem Nutanix to NC2 in cloud. Use the same Async, NearSync (where supported), or Metro mechanisms.
Failover to cloud without a second datacenter. When primary fails, VMs come up on NC2 in the cloud region.
Variable cost. Pay for cloud capacity continuously (active replication target) or with hibernation patterns (cluster spun down most of the time, spun up on failover or DR test).
Fast scaling. Add NC2 capacity in cloud at the rate of cloud provisioning, not at the rate of physical hardware procurement.

The economics: NC2 is meaningfully cheaper than building and maintaining a second physical datacenter for many mid-market customers. For large customers, the math depends on workload size, retention requirements, and whether they have existing colo space they would otherwise consolidate.

The honest constraints:

Cloud bare-metal pricing varies. Some workloads are cheap in cloud; some are expensive.
Replication bandwidth from on-prem to cloud is real cost (egress fees apply on failback in some models).
Failover RTO depends on whether the NC2 cluster is hot, warm, or cold.
Some workloads have data-sovereignty constraints that prevent cloud DR.

Diagram: Recovery Plan Failover Flow

Whiteboard ready NCP-MCI NCM-MCI

A primary-site loss triggers Recovery Plan execution at the DR site. VMs come up in the right order with the right network mappings. Test failovers can run this whole flow into an isolated network, without disrupting production.

Test Failover: The Feature That Customers Actually Use

The most important DR feature is the one customers neglect: testing.

Recovery Plans support test failover: run the entire failover sequence into an isolated network at the DR site, validate that everything comes up correctly, and tear down without affecting production. The test creates an isolated VLAN at the DR site, brings VMs up there, runs the configured checks, and reports.

Why this matters: DR runbooks that have not been tested in 12+ months frequently do not work when a real failover comes. The state of the world changes: VMs are added, network configurations drift, IP allocations change, application dependencies shift. The runbook decays.

Recommended customer cadence: quarterly test failovers minimum, monthly for critical workloads. The test takes 1-2 hours typically. The customer's DR program is real if and only if they actually run these.

What DR Genuinely Lacks vs Mature SRM Deployments

Honest gap list. Read it twice.

Some advanced runbook customization. SRM has 15+ years of accumulated capabilities for very complex orchestration patterns (cross-vendor app integration, complex pre-checks, vendor-specific scripted callouts). Recovery Plans handles the typical cases well; some advanced edge cases require scripted extensions.
Cross-vendor replication source flexibility. SRM can use array-based replication from a wide range of arrays. Recovery Plans is tied to Nutanix's native replication. For customers who want to keep their array's replication and orchestrate failover via SRM, that's a reason to keep SRM-on-ESXi-on-Nutanix.
Reporting depth. SRM's reporting around DR readiness, test history, and compliance posture has had more time to mature. Prism's reporting is increasingly capable but younger.
Cross-environment scope. SRM deployments often span heterogeneous infrastructure that includes non-Nutanix elements; Recovery Plans is Nutanix-centric.

For typical mid-market and enterprise general-purpose DR requirements, none of these are deal-breakers. For customers with established SRM and complex multi-vendor DR, the coexistence pattern is the durable answer.

What Nutanix DR Has That SRM Does Not

Integrated platform. DR is part of the platform, not a separate purchased product.
DSF-native snapshots underneath. No I/O penalty, no consolidation, instant.
Category-driven protection policies. New VMs auto-enroll based on tagging.
NC2 cloud DR option. SRM has cloud-DR via VMware Cloud, but the integration is more recent and the licensing is separate.
Single management plane. Replication, recovery plans, and DR test in the same UI as compute and storage.
Bundled licensing for basic capabilities. Recovery Plans and basic replication included; advanced features in NCM tiers.

Lab Exercise: Build a Protection Policy and Recovery Plan

Take a manual VM snapshot. From Prism Central, select a VM, choose "Take Snapshot." Note the type options: crash-consistent (default) or application-consistent (requires NGT in the guest, which lab VMs may not have).

Install NGT on a Linux VM. SSH in, then mount and install:

sudo mount /dev/cdrom /mnt/cdrom
sudo /mnt/cdrom/installer/linux/install_ngt.py

Take an application-consistent snapshot. With NGT installed, the snapshot UI offers application-consistent. Take one. Verify it succeeds.
Create categories if you haven't already (Module 4 lab):
- Key: Environment, Values: Production, Development, Test
- Key: BackupTier, Values: Gold, Silver, Bronze
Tag VMs with categories. Apply Environment: Production and BackupTier: Gold to your lab VM.
Create a Protection Policy in Prism Central: Name Lab-Production-Policy, match VMs with category Environment: Production, snapshot every 1 hour retain 7 days, replication disabled (single-cluster lab) or to a second cluster if available.
Verify policy enrollment. Confirm the VM is automatically included based on category.
Tag a second VM with Environment: Production. Confirm it auto-enrolls without manual addition.
(Multi-cluster, if available) Pair two clusters. Configure replication. Validate snapshots transfer.
Create a Recovery Plan. Recovery Plans > Create. Define name, source/target, VMs (via category), startup order (DB > App > Web), network mapping, optional scripts.
Run a test failover (multi-cluster, optional). The platform spins up VMs at the DR site in an isolated network, runs your defined checks, and reports.
Inspect Curator's role in protection. From a CVM:
```
curator_cli get_curator_state
```
Note the protection-related background tasks: snapshot reclamation, replication queue management, retention enforcement.

What this teaches you:

The snapshot-consistency distinction in practice.
Category-driven Protection Policy enrollment.
Recovery Plan structure and configuration.
The CLI surface for protection diagnostics.

A) DSF snapshots are core to the platform.
B) Async replication via PDs has been included for many AOS versions.
D) Snapshot scheduling is included.

The trap

A and B sound like they should be tier-gated, but they are platform baseline. Recovery Plans' advanced features are the licensed add-on.

Q11NCX-MCI prep · NCM-MCI prep · sales-relevant

NCX-style design question. There is no single correct answer; there are stronger and weaker frames. Write your reasoning, then click to compare against the strong-answer outline.

A customer is consolidating their VMware environment onto Nutanix. They have:

Two physical datacenters in the same metro area (50 km apart, 2-3 ms RTT, 10 GbE link)
One co-located DR datacenter 800 km away (35-40 ms RTT, 1 Gbps WAN)
1,200 VMs total: ~50 mission-critical (zero-RPO, financial systems), ~250 Tier-1 production (5-15 min RPO, mostly databases), ~900 general-purpose (4-hour RPO acceptable)
Existing investment in SRM with about 300 VMs orchestrated through it
Compliance mandate for monthly DR test attestations

Design the data protection architecture: replication mode per tier, Protection Policy structure, Recovery Plan design, the SRM transition approach, DR test cadence.

A strong answer covers

Multi-site architecture. Datacenters A and B (metro pair): Metro Availability between them for the 50 mission-critical VMs (zero RPO, 2-3 ms RTT well within Metro's 5 ms budget). Witness VM hosted at the third (DR) datacenter to provide quorum for the metro pair. DR datacenter (800 km): Async replication target for all tiers, NearSync for Tier-1 if bandwidth permits.
Protection Policy structure (in Prism Central). Policy 1 "Mission-Critical-Metro": match VMs with Tier: Mission-Critical, Metro replication between A and B, hourly snapshots retained 14 days, replication to remote DR via Async. Policy 2 "Tier1-NearSync": match Tier: Tier1, NearSync to DR datacenter (verify capacity), 5-minute RPO target. Policy 3 "GeneralPurpose-Async": match Tier: General, Async replication to DR at 4-hour interval.
Recovery Plans. One per tier, defining startup order (DB > App > Web), network mappings (DC A VLANs to DR DC VLANs), IP remapping or DHCP reassignment, pre/post checks. Test cadence: monthly for mission-critical (compliance-driven), quarterly for Tier-1, semi-annual for general-purpose.
SRM transition. SRM continues on ESXi-on-Nutanix for the 300 currently-orchestrated VMs (no forced migration). As workloads migrate to AHV they move to Recovery Plans. Plan a 12-18 month phased transition, prioritizing simpler workloads first.
Bandwidth math. Verify 1 Gbps WAN supports sustained replication for the workload mix. NearSync for 250 VMs requires careful sizing; if bandwidth is constrained, drop to Async with 15-min intervals for Tier-1.
DR test cadence. Monthly mission-critical full failover (compliance attestation), quarterly Tier-1, semi-annual general-purpose (rotating subset). Document each result in the audit log.
Operational considerations. Witness VM availability is critical. Categories drive policy enrollment; new VMs must be properly tagged at creation. Use Prism Central RBAC to give the DR team appropriate access without broader admin rights.
What you still need to know. WAN bandwidth profile (full 1 Gbps available, or shared?). Specific compliance framework (PCI DSS? SOX?). Application dependencies for startup order. SRM's current runbook complexity (scripted callouts that don't translate cleanly?). Should NC2 be considered as an additional or alternative DR option?

A weak answer misses

Defaulting to Metro for all mission-critical without acknowledging Witness VM placement.
Forgetting to plan Tier-1 NearSync bandwidth feasibility on a 1 Gbps WAN.
Forced SRM migration timeline rather than coexistence.
Missing the test cadence (compliance-driven monthly is the customer's specific requirement).
Not naming category hygiene as an operational requirement.

Why this matters for NCX

NCX panels probe multi-site DR designs. The right answer integrates replication topology, orchestration, transition strategy, operational rhythm, and identifies the constraints that need validation. Pure-feature answers fail.

Q12NCX-MCI prep · sales-relevant

NCX-style architectural defense. Respond to the customer's senior DR architect. He is making a real argument; address it.

You are in front of a customer's senior DR architect, who has run SRM-based DR for 14 years. He says:

"SRM has 14 years of runbook customization, deep VMware integration, mature reporting, and a clear escalation path with VMware. Recovery Plans is younger, less feature-rich, and tied to Nutanix. Why would I move my proven DR practice to a less mature product?"

A strong answer covers

Acknowledge SRM's maturity directly. SRM is mature. Recovery Plans is younger. Pretending otherwise loses credibility.
Reframe the comparison precisely. Maturity vs integration. SRM's maturity is real. Recovery Plans' integration is also real: part of the platform, no separate purchase. The comparison is "mature standalone vs integrated platform feature," not "mature vs less mature in isolation."
Runbook customization. SRM has more advanced scripted-callout customization. For typical enterprise DR, this is rarely the differentiator. Walk through what customization the customer actually uses; if it is simple ordering and IP remapping, Recovery Plans handles it; if there are complex scripted callouts, map those carefully.
VMware integration. SRM's tight VMware integration is a feature when the entire stack is VMware. As workloads move to AHV the integration value diminishes (the workloads are no longer VMware). For ESXi-on-Nutanix, SRM continues to work; you don't lose the integration where it matters.
Reporting maturity. SRM has more reporting depth. Prism's DR reporting is improving. For compliance-driven reporting, both can typically meet requirements; for nuanced operational dashboards, SRM has more polish today.
Escalation path with vendor. Real. VMware's DR support is mature; Nutanix's is also mature. Evaluate via reference customers.
Reframe the migration question. "You don't have to migrate. SRM continues to work on ESXi-on-Nutanix; your existing investment is preserved. New workloads on AHV use Recovery Plans. Evaluate Recovery Plans on its merits over time, not under migration pressure."
Concrete validation step. "Run a test failover with Recovery Plans on a non-critical AHV workload. Compare the experience to SRM. Decision will be informed by hands-on, not feature-comparison decks."
Close with the durable framing. "I am not here to replace what works. The right answer is probably hybrid for the next 12-24 months: SRM continues to handle what it orchestrates today; Recovery Plans handles new AHV workloads."

A weak answer misses

Claiming Recovery Plans matches SRM in every dimension.
Dismissing the architect's 14 years as outdated.
Forcing a migration timeline.
Not naming the coexistence pattern as the durable answer.
Not offering the hands-on evaluation step.

Why this matters for NCX

Senior DR architects with deep SRM history are common in enterprise. The skill being tested is acknowledging real expertise, naming real gaps, and reframing to coexistence rather than forced migration. This is also the disposition that wins enterprise DR conversations.

What You Now Have

You can distinguish crash-consistent from application-consistent snapshots and know when each is appropriate. You know NGT's role in providing VSS coordination on Windows.

You know the difference between Protection Domains (legacy, Prism Element, manual membership) and Protection Policies (modern, Prism Central, category-driven). You can recommend Policies for new deployments while respecting existing PDs.

You have the three replication modes mapped to RPO and operational characteristics: Async (1-hour typical, WAN-friendly, low overhead), NearSync (sub-15-minute RPO, low-latency required, moderate-high overhead), Metro (zero RPO, <5 ms RTT, witness required, highest cost).

You can map application tiers to the right replication mode in 15 minutes. The matrix is in your hands.

You have Recovery Plans (Nutanix Disaster Recovery, formerly Leap) as the SRM equivalent: orchestrated failover, network mapping, IP remapping, startup order, test failover, all integrated into Prism Central.

You can compare Recovery Plans to SRM honestly: SRM is more mature for advanced runbook customization; Recovery Plans is integrated and capable for typical use; coexistence is the durable answer for established SRM customers.

You have NC2 as the cloud DR option that eliminates the need for a second datacenter. The economics often beat physical DR for mid-market customers.

You know test failover is the feature customers neglect and the durable BlueAlly value: quarterly tests in 1-2 hours each instead of multi-day fire drills with disrupted production.

You are now ready for the unified storage layer. Module 8 covers Files, Objects, and Volumes: storage services that sit on top of DSF and replace separate file storage, object storage, and iSCSI block targets the customer is currently buying as separate appliances.

References

Authoritative sources verified during the technical review pass on this module. RPO numbers, latency thresholds, and product-naming history are validated against current Nutanix documentation; reverify before quoting specifics in a customer architecture proposal.

Nutanix Bible · AOS Backup and DR. Authoritative source for snapshot semantics, replication modes (Async, NearSync, Metro), and the LWS / LWS-store architecture.
Nutanix Bible · Disaster Recovery Services. Recovery Plans, Protection Policies, runbook orchestration.
TN-2027 · NearSync replication powered by Light-Weight Snapshots. Authoritative NearSync technical reference; confirms 20-second RPO floor and LWS-on-SSD storage detail.
TN-2027 · Metro Availability. Metro Availability technical reference.
BP-2009 Metro Availability Best Practices. 5 ms RTT ceiling, witness placement, failure-handling configurations.
Metro Cluster Latency: Microbursts and RTT Risk. Production design guidance: ≤3.5 ms RTT target under load with 5 ms as ceiling, P99.9 vs average latency considerations.
Migrating a Guest VM from a Protection Domain to a Protection Policy (Nutanix Community). PD-to-Policy migration workflow and disruption considerations.
Disaster Recovery with Nutanix AOS 6.10 and Prism Central 2024.2 (SOSTechBlog). Walkthrough of the current Nutanix Disaster Recovery (formerly Leap) UX.
AOS 5.17: NearSync 20-second RPO announcement. Original announcement of the 20-second NearSync RPO milestone.
NC2 on AWS · Product page. Cloud DR architecture reference for the NC2 section.
Prism Element Data Protection Guide v7.3. Current Protection Domain documentation in Prism Element.

Cross-References

Glossary: Crash-consistent · Application-consistent · Protection Domain · Protection Policy · Async Replication · NearSync · Metro Availability · LWS · Recovery Plan · Witness VM · NC2 · SRM · RPO · RTO Look up in Appendix A
Comparison Matrix: Replication Modes · DR Orchestration · Cloud DR Look up in Appendix B
Objections: #26 "What about SRM?" · #27 "We have array-based replication" · #28 "DR is too complex to migrate" · #29 "Cloud DR isn't for us" · #30 "Test failover is too disruptive" Look up in Appendix D
Discovery Questions: Q-DR-01 RPO/RTO targets per tier · Q-DR-02 existing DR infrastructure · Q-DR-03 SRM footprint · Q-DR-04 DR test cadence and history · Q-DR-05 compliance / regulatory drivers Look up in Appendix E
Sizing Rules: Replication bandwidth math · NearSync cluster overhead · Metro witness placement Look up in Appendix F