Reliability in the Atlas Well-Architected Framework

The Reliability pillar of the Atlas Well-Architected Framework includes features and strategies that minimize downtime and prevent data loss. A reliable workload is aware of failures as they occur and can take efficient, and often automatic, action to regain availability and recover from data loss.

Foundations for Reliability

The following are the foundations to designing a reliable and resilient Atlas deployment:

High Availability (HA): Deploy architectures that automatically self-heal when infrastructure fails. HA provides automatic failover with RPO = 0 and RTO in seconds.
Disaster Recovery (DR): Implement manual recovery procedures using backups for scenarios that automatic failover cannot address, such as data corruption or accidental deletion.
Business Continuity Planning (BCP): Create a comprehensive plan that combines HA architecture, DR procedures, testing, and documentation to meet your RTO and RPO objectives.

Definitions

Recovery Time Objective (RTO) is the maximum acceptable downtime before the application is restored and starts serving traffic after a disruption.
Recovery Point Objective (RPO) is the maximum amount of data you can afford to lose in an outage, measured in units of time.
Availability is a measure of how reliably your system is accessible and functional when needed. It's often expressed as a percentage representing the proportion of time the system is available over a given period. For example, the gold standard of availability is often cited as 99.999%, or "five nines," which translates to approximately 5 minutes and 25 seconds of potential downtime per year.

Overview of Atlas Features for Reliability

Atlas provides the following complementary approaches to reliability:

High Availability - Automatic Protection

Atlas deployments use replica sets with automatic failover to provide continuous availability during infrastructure failures. Each cluster deploys a minimum of three database instances spread across different availability zones. When a node or zone fails, automatic failover completes within seconds with zero data loss (when using majority write concern). Scale your deployment across multiple regions or cloud providers for protection against regional or provider outages.

Disaster Recovery - Manual Protection

Backups provide protection for scenarios that automatic failover cannot address, such as data corruption, accidental deletion, or complete deployment loss. Atlas offers fully managed backups with configurable frequency, point-in-time recovery, and multi-region distribution. These require manual intervention to restore but protect against data integrity issues that replicate across all nodes.

Your Comprehensive Plan

Combine both approaches in a business continuity plan that documents your recovery objectives, deployment architecture, backup strategy, testing procedures, and response plans for different failure scenarios.

Use the following Atlas Architecture Center resources to learn more about the features and strategies for reliability in Atlas:

High Availability

Create cluster configurations that meet your availability needs and expedite recovery from disasters.

Disaster Recovery

Implement manual recovery using backups for data corruption, accidental deletion, and scenarios that automatic failover cannot address.

Business Continuity Planning

Create a comprehensive resilience plan combining high availability architecture, disaster recovery procedures, testing, and documentation.

Back

Logging

High Availability