Atlas Update – Faster Scaling

Jay Gordon


MongoDB Atlas has only been around for a few months and already we’ve improved the speed at which you can scale your clusters.

“Scaling should be faster than a speeding bullet” - Jennifer Seelin
Jennifer - Corporate Communications Super Hero @MongoDB.

Some heroes indeed do wear capes, like the Cloud Engineers at MongoDB. OK, they really do not wear capes but they do get coffee and plenty of snacks. Recently it was time to look at the time spent on working with MongoDB Atlas, and specifically scaling to your requirements.

Your time, your data and all things you do are precious. No one needs to spend extra time waiting on larger sized hardware or cloud servers. When we first launched MongoDB Atlas, modifying your cluster was a lot of work on the back end. Our engineering department had to take a hard look at the process in which scaling worked and find a method to best utilize our Cloud vendor’s rich APIs and standard UNIX tools.

One of the more common tasks our users have is to scale from one size to another. For our initial implementation we chose a process that was simple, worked for all configuration changes, but was not particularly fast. When a user requested a new configuration we would terminate one server, build a new server in the new configuration, and then wait for MongoDB replication to copy the data and rebuild the indexes on the new server. This was repeated for each server, until all servers satisfied the new configuration. .

Let’s look at the “old way” from a high level:

While this was an extremely reliable process it could be time consuming. Each step required waiting on the AWS API to respond with the proper status of the instance while we continued to serve data. As each part of the process continued, your primary and at least one secondary remained on line. The downside of this is sometimes we saw those index build times just take far too long. It was time to make a fundamental change to this process.

In order to expedite this upgrade during scaling what we have implemented a new and faster process for our M30 through M100 instance types.

Let’s upgrade from an M30 instance to a M40. We’ve gone ahead and accessed our Atlas UI and started our changes:

Atlas receives details on your new plan and then begins the process of putting it into service. A basic overview of how Atlas functions flow can be seen here:

Now we’ve reached the “Plan Execution” point of our scaling and how we improved speed has so much to do with the replication of your data.

As mentioned prior, our previous method (still in used for upgrades from M10/M20 instances to to M30 and above due to resource restrictions by Amazon Web Services) would require MongoDB replication and index creation for these new instances during an initial sync.

“Avoid the Initial-Sync”

Rather than destroy your EC2 instance, we stop each one at a time in a rolling fashion and then notify the AWS API of your new MongoDB Atlas Class. If modifying your disk attributes, to avoid the time consuming method of an initial sync, we turned to standard UNIX utilities.

We take the time to review the data currently onboarded to your deployment’s disks and validate our checksums match to ensure a complete replica of the data. We also allow validate once again, by reviewing existing members in your set and ensuring normal replication continues.

As we reach the point of resize a job is kicked off that creates a duplicate of your data directory along with all previously existing indexes and options. There’s no additional need for MongoDB to spend time to rebuild any of this. Here’s where the actual speed of the process comes into play, no longer do we worry about recreating indexes. This process for larger deployments can actually take longer than copying the data.

Once the process of this data copy completes, MongoDB continues to resume reading your the OpLog to resume any writes that occurred during the upgrade process, this is how our standard replication works.

As this upgrade process occurs one by one, you’ll notice that you will have no downtime, no outages and a continuation of normal service. Your connection string will never be modified based on this, so there’s no need for you to make code changes on the application side to reflect you’ve scaled in any way.

“Eradicate Downtime”

Need to size up because your application is taking off? Finished with a project and need a smaller instance size? This new method of scaling will ensure you’ll stay able to keep up with your always changing environment and remain stable. The ability to change your cluster with no downtime is one of the most powerful features of our offering and why developers and companies continue to migrate their workloads to Atlas.