Make the MongoDB docs better! We value your opinion. Share your feedback for a chance to win $100.
MongoDB Branding Shape
Click here >
Docs Menu

Guidance for Atlas Disaster Recovery

Disaster recovery (DR) is a resilience strategy that uses manual procedures to recover your deployment when automatic failover cannot help. Unlike high availability, which provides automatic self-healing for infrastructure failures, disaster recovery requires human intervention to restore from backups.

Note

The MongoDB Atlas Shared Responsibility Model defines the complementary duties of MongoDB and its customers in maintaining a secure and resilient data environment. Under this framework, MongoDB manages the security and operational integrity of the underlying platform, while customers are responsible for the configuration, management, and data policies of their specific deployments. For a detailed breakdown of ownership across security and operational excellence, see Shared Responsibility Model.

Important

Prefer high availability when possible. Use disaster recovery as your primary resilience strategy only when high availability is not feasible.

Disaster recovery is appropriate as your primary resilience strategy in the following situations:

Geographic Constraints
You cannot deploy to multiple regions with sufficient availability zones. For example, if you must deploy in Canada, only 2 regions are available, preventing automatic failover quorum requirements.
Cost Optimization
Your budget requires minimizing infrastructure costs. Multi-region or multi-cloud high availability deployments cost more than single-region deployments with backups.
Tolerance for Manual Recovery
Your application can tolerate the time required for manual intervention and backup restoration (RTO in minutes to hours rather than seconds).

Note

Even if you use high availability, you might benefit from disaster recovery procedures for scenarios that automatic failover cannot address, such as data corruption or accidental deletion. To learn more about combining both approaches, see Guidance for Atlas Business Continuity Planning.

Backups are the primary tool for implementing disaster recovery. Atlas provides fully managed, customizable backups that enable manual recovery when automatic failover cannot help.

When you enable Cloud Backup for your cluster, Atlas uses your cloud provider's native snapshot capabilities to create incremental backups that are immutable by default with fast restore times.

How it supports DR:

  • Configure snapshot frequency (hourly, daily, weekly, monthly, yearly) to meet your RPO requirements.

  • Set snapshot retention periods to meet compliance and recovery needs.

  • Restore snapshots to recover from data corruption or deletion.

When to use: Standard disaster recovery scenarios where you can tolerate data loss equal to your snapshot frequency. For example, hourly snapshots result in up to 1 hour of data loss.

When you enable Continuous Cloud Backup in addition to Cloud Backup, Atlas stores oplog data along with your Cloud Backup snapshots, allowing Atlas to restore your cluster to a specific moment in time within a configurable Point in Time (PIT) restore window.

How it supports DR:

  • Restore to any moment within a configurable PIT restore window. Atlas retains oplog data for up to the length of the restore window, and replays oplog entries on top of the nearest Cloud Backup snapshot to restore to the precise moment you specify.

  • Achieve an RPO as low as 1 minute.

  • Precisely target the moment before data corruption or deletion occurred.

When to use: When you need minimal data loss (low RPO) or precise recovery to a specific timestamp.

When you enable Cloud Backup with a snapshot copy policy, you can configure Atlas to automatically distribute copies of your snapshots and oplogs to additional cloud provider regions in other geographies.

How it supports DR:

  • Meets compliance requirements for geographically distributed backups.

  • Ensures backups remain accessible even if the primary region fails completely.

  • For a multi-region cluster, enabling the Automatically Sync Copy Regions with Cluster Regions option in your snapshot copy policy enables Atlas to automatically distribute snapshots to all regions where your cluster is deployed, and keeps your snapshot copy policy in sync as you add or remove regions. This enables Atlas to use faster, direct-attach restores to restore cluster regions using local snapshot copies after a regional outage, instead of slower cross-region streaming restores.

When to use: When you need protection against complete regional loss or provider outages that exceed your HA configuration.

Backup strategies protect against data integrity failures that automatic failover cannot address, such as data corruption, accidental deletion, and complete deployment loss. All backup strategies involve some data loss (RPO > 0) and longer recovery times (RTO in minutes to hours). The trade-offs are in how much data you might lose and the operational complexity of each approach.

Backup strategy
Primary use cases
Typical RTO
RPO
Operational considerations
Cost considerations

Periodic snapshots

Data corruption, accidental deletion, complete deployment loss.

Minutes to hours.

> 0 (up to snapshot interval).

Define schedule/ retention. Regularly test restore procedures.

Storage grows with frequency and retention. Restore incurs compute/ egress costs.

Continuous backup with point-in-time restore

Data corruption or deletion where "bad" timestamp is known.

Minutes to hours.

> 0 (near-zero, down to seconds).

Configure Point in Time (PIT) restore windows. Monitor for failures.

Higher storage and backup cost than snapshots alone.

Multi-region snapshot distribution

Region or provider loss that also affects local backups.

Minutes to hours (can be slower cross-region).

> 0 (tied to snapshot schedule).

Plan and test cross-region restores. Manage retention policy.

Additional storage in multiple regions. Cross-region egress on restore.

The following recommendations apply to all deployment paradigms.

This section covers the following disaster recovery procedures:

If a single node in your replica set fails due to a partial regional outage, your deployment should still be available, assuming you have followed best practices. If you are reading from secondaries, you might experience degraded performance or potential outages in the event that a secondary node should fail, because of the increased load on the then underprovisioned cluster.

You can test a primary node outage in Atlas using the Atlas UI's Test Primary Failover feature or the Test Failover Atlas Administration API endpoint.

In the event of a regional outage, multi-region clusters automatically hold an election and identify a new primary node if necessary. This topology change is automatically communicated to the application, allowing it to take any necessary corrective action. In order to maintain application uptime in the event of a regional outage, your application must also be deployed with a multi-region topology. This requirement extends to include any third-party service your application may be integrated with. To learn more, see Multi-Region Deployment Paradigm.

If a single region outage or multi-region outage degrades the state of your cluster, follow these steps:

1
2

You can find information about cluster health in the cluster's Overview tab of the Atlas UI.

3

Based on how many nodes are left online, determine how many new nodes you require to restore the replica set to a normal state.

A normal state is a state in which the majority of nodes are available.

4

Depending on the cause of the outage, there may be additional regions in the near future that will also experience unscheduled outages. For example, if the outages were caused by a natural disaster on the east coast of the United States, you should avoid regions on the east coast of the United States in case there are additional issues.

5

Add the required number of nodes for a normal state across regions that are unlikely to be affected by the cause of the outage.

To reconfigure a replica set during an outage by adding regions or nodes, see Reconfigure a Replica Set During a Regional Outage.

6

In addition to adding nodes to restore your replica set to a normal state, you can add additional nodes to match the topology of your replica set before the disaster.

You can test a region outage in Atlas using the Atlas UI's Simulate Outage feature or the Start an Outage Simulation Atlas Administration API endpoint.

With multi-cloud clusters, you can select electable nodes across cloud providers to maintain high availability. Should the provider in which your primary node is deployed become unavailable, Atlas automatically elects new primary nodes to ensure continuous operation. For example, you can create electable nodes on AWS, Google Cloud, and Microsoft Azure to ensure that if one cloud provider experiences an outage, an electable node on a separate provider can automatically take over as your cluster's primary node. To learn more, see Multi-Cloud Deployment Paradigm.

Most multi-region Atlas clusters will recover automatically from a single region outage. To learn more, see Guidance for Atlas High Availability and Multi-Region Deployment Paradigm. In the case that regional outages have taken a majority of nodes offline, you must determine how many more nodes you need to add in order for a majority of nodes to be healthy.

In the highly unlikely event that an entire cloud provider is unavailable, follow these steps to bring your deployment back online:

1

You need this information later in this procedure to restore your deployment.

2

For a list of cloud providers and information, see Cloud Providers.

3

To learn how to view your backup snapshots, see View M10+ Backup Snapshots.

4

Your new cluster must have an identical topology of the original cluster.

Alternatively, instead of creating a full new cluster, you can also add new nodes hosted by an alternative cloud provider to the existing cluster.

5

To learn how to restore your snapshot, see Restore Your Cluster.

6

To find the new connection string, see Connect via Client Libraries. Review your application stack as you likely need to redeploy it onto the new cloud provider.

In the highly unlikely event that the Atlas Control Plane and the Atlas UI are unavailable, your cluster is still available and accessible. To learn more, see Platform Reliability. Open a high-priority support ticket to investigate this further.

Computational resource (such as disk space, RAM, or CPU) capacity issues can result from poor planning or unexpected database traffic. This behavior might not be a result of a disaster.

If a computational resource reaches the maximum allocated amount and causes a disaster, follow these steps:

1

To view your resource utilization in the Atlas UI, see Monitor Real-Time Performance.

To view metrics with the Atlas Administration API, see Monitoring and Logs.

2
3

Note that Atlas will perform this change in a rolling manner, so it should not have any major impact on your applications.

To learn how to allocate more resources, see Edit a Cluster.

4

Important

This is a temporary solution intended to shorten overall system downtime. Once the underlying issue resolves, merge the data from the newly-created cluster into the original cluster and point all applications back to the original cluster.

If a computational resource fails and causes your cluster to become unavailable, follow these steps:

1
2
3

To learn how to restore your snapshot, see Restore Your Cluster.

4

Production data might be accidentally deleted due to human error or a bug in the application built on top of the database. If the cluster itself was accidentally deleted, Atlas might retain the volume temporarily.

If the contents of a collection or database have been deleted, follow these steps to restore your data:

1
2

You can use mongoexport to create a copy.

3

If the deletion occurred within the last 72 hours, and you configured continuous backup, use Point in Time (PIT) restore to restore from the point in time right before the deletion occurred.

If the deletion did not occur in the past 72 hours, restore the most recent backup from before the deletion occurred into the cluster.

To learn more, see Restore Your Cluster.

4

You can use mongoimport with upsert mode to import your data and ensure that any data that was modified or added is reflected properly in the collection or database.

If a driver fails, follow these steps:

1

You can work with the technical support team during this step.

Determine whether the issue is related to an outdated driver version or a recently updated driver version.

2
  • If you are using an outdated driver, check if upgrading to a newer version resolves the issue. Most driver problems are fixed in newer releases.

  • If you recently upgraded your driver and suspect the new version introduced the issue, consider reverting to the previous working version.

3

This might include application code or query changes. For example, there might be breaking changes if you are moving between major versions, or new features available if upgrading.

4
5

Ensure that any other changes from the previous step are reflected in the production environment.

Important

This is a temporary solution intended to shorten overall system downtime. Once the underlying issue resolves, merge the data from the newly-created cluster into the original cluster and point all applications back to the original cluster.

If your underlying data becomes corrupted, follow these steps:

1
2
3

To learn how to restore your snapshot, see Restore Your Cluster.

4
5