Test Primary Failover

On this page

Required Access
Prerequisites
Test Primary Failover Process
Verify the Failover
Troubleshoot Failover Issues

Note

This feature is not available for M0 Free clusters and Flex clusters. To learn more about which features are unavailable, see Atlas M0 (Free Cluster), M2, and M5 Limits.

Atlas conducts replica set elections when it makes configuration changes, such as patch updates, scaling events, and when failures occur. Your applications should handle replica set elections without any downtime. To learn how to build a resilient application, see Build a Resilient Application with MongoDB Atlas.

You can enable retryable writes by adding retryWrites=true to your Atlas URI connection string. To learn more, see Retryable Writes.

You can use the Atlas UI and API to test the failure of the replica set primary in your Atlas cluster and observe how your application handles a replica set failover.

Required Access

To start a failover test, you must have Organization Owner, Project Owner, Project Cluster Manager or Project Stream Processing Owner access to the project.

Prerequisites

Before you test the failure of the replica set primary, you must meet the following conditions:

All pending changes to your cluster must be complete.
All members of the cluster must be in a healthy state with up-to-date monitoring data.
Each replica set or shard must have a primary node.
Any member of the cluster must have less than a 10-second replication lag.
All members of your cluster must have at least 5% of available disk space remaining.
All primary node oplogs must have enough space for a three-hour operation.

Test Primary Failover Process

Important

Ensure that your Atlas cluster is healthy before you test primary failover. Otherwise, Atlas might reject your request.

When you submit a request to test primary failover, Atlas simulates a failover event. During this process:

Atlas shuts down the current primary.
The members of the replica set hold an election to choose which of the secondaries will become the new primary.
Atlas brings the original primary back to the replica set as a secondary. When the old primary rejoins the replica set, it will sync with the new primary to catch up any writes that occurred during its downtime.

The following statements describe Atlas behavior during rollovers and when testing failover in sharded clusters:

If the original primary accepted write operations that had not been successfully replicated to the secondaries when the primary stepped down, the primary rolls back those write operations when it re-joins the replica set and begins synchronizing. To learn more, see Rollbacks During Replica Set Failover. Contact MongoDB Support for assistance with resolving rollbacks.
Only the mongos processes that are on the same instances as the primaries of the replica sets in the sharded cluster are restarted.
The primaries of the replica sets in the sharded cluster are restarted in parallel.

To start a failover test for the specified cluster in your project using the Atlas CLI, run the following command:

atlas clusters failover <clusterName> [options]

To learn more about the command syntax and parameters, see the Atlas CLI documentation for atlas clusters failover.

Tip

See: Related Links

You can use the Test Failover API endpoint to simulate a failover event. To learn more about the failover process, see Test Failover Process.

To perform a Primary Failover test using the Atlas UI:

In Atlas, go to the Clusters page for your project.
Warning
Navigation Improvements In Progress
We're currently rolling out a new and improved navigation experience. If the following steps don't match your view in the Atlas UI, see the preview documentation.
1. If it's not already displayed, select the organization that contains your desired project from the Organizations menu in the navigation bar.
2. If it's not already displayed, select your desired project from the Projects menu in the navigation bar.
3. If it's not already displayed, click Clusters in the sidebar.
  The Clusters page displays.
For the cluster you wish to perform failover testing, click the ... button.
Click Test Resilience.
On the Test Resilience modal, click the Primary Failover tab. Atlas displays the steps that it takes to simulate a failover event. To learn more, see Test Failover Process.
Click Restart Primary to begin the test. Atlas displays the results of your simulated failover process in the Test Resilience modal.

Verify the Failover

To verify that the failover is successful:

In Atlas, go to the Clusters page for your project.

Warning

Navigation Improvements In Progress

We're currently rolling out a new and improved navigation experience. If the following steps don't match your view in the Atlas UI, see the preview documentation.

If it's not already displayed, select the organization that contains your desired project from the Organizations menu in the navigation bar.
If it's not already displayed, select your desired project from the Projects menu in the navigation bar.
If it's not already displayed, click Clusters in the sidebar.
The Clusters page displays.

Observe the nodes.

Click the name of the cluster for which you performed the failover test.
Observe the following changes in the list of nodes in the Overview tab:
- The original PRIMARY node is now a SECONDARY node.
- A former SECONDARY node is now the PRIMARY node.

Troubleshoot Failover Issues

If your application doesn't handle the failover gracefully, ensure the following:

You are using the SRV Connection Format.
You are using the latest version of the driver.
You have implemented appropriate retry logic in your application.

Back

Test Resilience

Simulate Regional Outage