Reliability in the Atlas Well-Architected Framework

The Reliability pillar of the Atlas Well-Architected Framework includes features and strategies that minimize downtime and prevent data loss. A reliable workload is aware of failures as they occur and can take efficient, and often automatic, action to regain availability and recover from data loss.

Foundations for Reliability

There are two foundations to designing a reliable and resilient Atlas deployment:

Design a deployment architecture with configuration options that ensure high availability in the face of expected infrastructure disruptions.
Create a disaster recovery plan that defines best practices and procedures to recover from disaster scenarios within your established RTO and RPO. This may involve configuring a backup policy to supplement your deployment's availability strategy and provide recovery options for data loss or corruption events.

Definitions

Recovery Time Objective (RTO) is the maximum acceptable downtime before the application is restored and starts serving traffic after a disruption.
Recovery Point Objective (RPO) is the maximum amount of data you can afford to lose in an outage, measured in units of time.
Availability is a measure of how reliably your system is accessible and functional when needed. It's often expressed as a percentage representing the proportion of time the system is available over a given period. For example, the gold standard of availability is often cited as 99.999%, or "five nines," which translates to approximately 5 minutes and 25 seconds of potential downtime per year.
High Availability refers to the ability of a system to remain accessible when faced with individual component failures. A deployment architecture designed for high availability often utilizes redundancy and failover mechanisms to achieve fault tolerance, meaning it can automatically switch to working components when a failure is detected.
Disaster Recovery refers to strategies for creating and managing discrete copies of the workload that can be utilized in disaster situations. A comprehensive disaster recovery plan defines procedures to regain system operation within a specified RTO, and recover data to a timestamp within a specified RPO, following a disaster scenario.

Overview of Atlas Features for Reliability

It's important to combine the correct high availability architecture, disaster recovery plan, and backup policy for your deployment in order to optimize reliability while balancing cost impact.

MongoDB's default deployment architecture is designed for high availability. Atlas deploys each cluster as part of a replica set with a minimum of three database instances (also called nodes) spread automatically across different availability zones. In the event of a single zone outage, failover between instances is fully automatic and completes within seconds without any data loss, including operations that were in flight at the time of the failure if retryable writes is enabled. To improve availability for your most critical applications, you can scale your deployment by adding nodes, regions, or cloud providers to withstand zone, region, or provider outages.

Backups are also critical to system reliability. While systems designed for high availability are less reliant on backups to protect against data loss, backups are still the best protection against disaster scenarios outside of infrastructure outage events that may result in data loss or corruption, such as human threats like cyber attacks or code errors. Robust disaster recovery planning involves deciding whether a backup policy is necessary to satisfy your calculated RPO and RTO.

Use the following Atlas Architecture Center resources to learn more about the features and strategies for reliability in Atlas:

High Availability

Create cluster configurations that meet your availability needs and expedite recovery from disasters.

Backups

Configure database backup options in Atlas and get recommendations to meet your RTO and RPO requirements with cluster-wide snapshots.

Disaster Recovery

Create a DR plan with steps to take if you experience an outage, deletion of prod data, and more.

Back

Logging

High Availability