Planning for Chaos with MongoDB Atlas: Using the "Test Failover" Button

Jay Gordon


When building an application, it's smart to consider chaos. Chaos can be introduced into an application in many different ways; some examples are:

  • Running out of disk space
  • Utilizing all connections to the cluster
  • Oversaturating the available IOPS
  • Network connectivity failure

To help you prepare for such an event, MongoDB Atlas has introduced a new feature called "Test Failover" that you can use to introduce some chaos for testing purposes.

Welcome to Chaos Engineering

One of the more popular terms to come out of the open source community has been "Chaos Engineering." On the "Principles of Chaos Engineering" you'll find the following definition that really encapsulates why the "Test Failover" feature in MongoDB Atlas exists:

Chaos Engineering is the discipline of experimenting on a distributed system in order to build confidence in the system’s capability to withstand turbulent conditions in production...

Chaos Engineering strives to eliminate the pain points in a distributed system by introducing a failure of one of the components in a test environment and reviewing the output. The harder it is to introduce chaos that will cause an application to no longer operate, the more confidence we can place in the infrastructure our app lives on.

The team at Netflix had to ensure that their massive distributed application would still survive if chaos was introduced. Based on the demand of their customer base and the distributed nature of their systems, the engineers at Netflix needed to ensure they could handle a failure of a production system. They created an open source tool called "Chaos Monkey" which you can read more about in this blog post.

The main intent of this chaos is to ensure that if part of production fails, you don't end up with a completely out of service application. For this reason, we've nicknamed the "Test Failover" feature the Chaos Button.

Chaos Checklist

One of the more important concepts of pre-production application architecture testing is ensuring that your application will continue to work during cases of unplanned outage. I like to create pre-deployment checklists to make sure I have considered all the potential ways my app could fail. These checklists typically consist of things like backups, restore testing and disaster recovery.

Some questions I like to have answered prior to going to production are:

  • Do I know how my app will respond when access to my data is temporarily interrupted?
  • When my database recovers, will the application work as expected?
  • Did I configure my application to utilize the full connection string to ensure failover?
  • If an issue occurs with my data, will I need to do any form of intervention?

By testing your application before going to production you're able to review how your app will survive an incident or planned maintenance where a failover may occur. You enable the best practice of ensuring you survive chaos, much like the team at Netflix did.

How "Test Failover" works

The "Test Failover" button will reboot the instance your primary lives on. Your cluster will perform an election and select one of your secondaries that has the most complete oplog to become your new primary.

Once failover is completed, the former primary instance is placed back into your cluster with the same hostname. Your connection string will not require modification as MongoDB drivers are smart enough to instantly know which members of your Atlas cluster are now primaries.

Begin your test

Note: In order to test failover, you need to be using a dedicated MongoDB Atlas cluster. This means that clusters on multi-tenant architecture will not have this feature.

To begin adding some "chaos", go to the "Clusters" menu for your organization, then find your project you'd like to work with. In the example shown below, I will use project "jg-MongoDB-Atlas-2017" to perform the chaos test.

Pick your project to perform chaos test

Once you get to the main window your cluster is listed in, you can then find the ellipsis menu, select it, and find "Test Failover."

Once you select “Test Failover”, you'll be brought to an information box that will inform you of what actions are about to happen:

Information box

Now click "RESTART PRIMARY", which will initiate the failover test as described above. You'll be shown a new window which informs you the test is underway.

You'll be able to tell what is going on by clicking on the cluster's name and reviewing the process as it occurs:

Review process as it occurs

You are able to see that the primary is moved to a new node and the failed over instance is having its data resynced from the new primary. At this time, if you are reviewing an application's stability, you may run some form of selenium test or a curl script that hits an endpoint to confirm a connection to your database is occurring as expected.

When completed, you'll see a new primary selected and the continuation of normal service:

New primary selected

That's it — there's no need to modify connection strings or edit your app. Your cluster's backup, replication, and other services will continue with no required intervention from you.

If you’re new to managed MongoDB services, we encourage you to start with our free tier. For existing customers of third party service providers, be sure to check out our migration offerings and learn about how you can get 3 months of free service.