Your Ultimate Guide to Rolling Upgrades

Bryan Reinero

#Technical

No matter what database you use, there’s a variety of maintenance tasks that are periodically performed to keep your system healthy. And no matter what database you use, maintenance work on a production system can be risky. For this reason maintenance work is typically performed during periods of scheduled downtime – the database is taken offline, and normal business operations are suspended. Usually these hours are more convenient for users, but less so for the operations teams (e.g., early morning hours on the weekend).

Scheduled maintenance is a pain, and today’s users are intolerant to prolonged periods of downtime. They expect their applications to be available all of the time. You need a method for making updates while keeping your system available. Enter replication and rolling maintenance.

Replica sets may be familiar to you as the mechanism by which MongoDB provides high availability, assuring the database recovers quickly from node failures or network partitions. But replica sets also give you the ability to perform heavy weight maintenance tasks without affecting database performance or losing availability. By performing maintenance on each secondary, one-by-one, the primary node is never subject to loss of availability or degraded performance. This is the key to rolling maintenance.

The whole operation is like bringing a team of race cars into the pits one at a time. While one car is in the pits, the teammate continues racing on the track, assuring there is never a moment when a team car isn’t on-track.

source Bert Van Dijk

The per-node maintenance operation is accomplished by first restarting the node in stand-alone mode, executing the maintenance task (e.g. index build, version upgrade, or compaction), then finally restarting the process in replSet mode once the task is complete.

You may shut down the secondary with the mongo shell with the following command.

replset:SECONDARY> db.adminCommand( "shutdown" )

Restarting the node in stand-alone mode means you simply restart the process without the replSet command line parameter. Or if you use a configuration file, remove the replSet parameter from the config.

This pattern is repeated for each secondary node in the replica set, until the primary is the last node which hasn’t had the maintenance performed. At this point, the primary is “stepped-down” to secondary status with use of the rs.stepDown() command. A new primary will be elected and the last maintenance tasks can be completed on this former primary.

Preparing for rolling maintenance

Performing rolling maintenance is a great convenience, but there are some gotchas you should be aware of. You'll need to have a plan that assures a smooth predictable process by minimizing human error, misconfiguration, and unnecessary loss of availability. There are four areas you maintenance plan should cover.

  • Pre-check oplog size
  • Stepping down the primary predictably
  • Assure sufficient availability during maintenance
  • Declare indexes carefully and consistently

Pre-Check Your Oplog SIze

At the heart of MongoDB's replication system is a circular, fixed-size collection called the oplog (short for "operation log"). When the primary node receives a write operation it records it to the oplog. The secondaries can then replicate from the primary by consuming each operation in the primary's oplog and applying that operation to themselves.

It’s helpful to think of the oplog like a tape loop. Just as tape loops allow me to record continuously without ever running out of tape, the oplog allows a Replica Set node to write operations continuously without consuming ever-increasing amounts of disk. Just as the size of the loop determines how much I can record before it starts overwriting itself, a larger oplog retains more operations before overwriting itself.

If a node failure or network partition temporarily isolates my secondary it won't be able to replicate. That's ok since the secondary will remember where it was on the oplog and can catch up once it is able to reconnect. However, If the secondary stays isolated for a long enough time, the primary could overwrite the secondary's position in the oplog. This is called "falling off the oplog", and when it happens the secondary has fallen to too far behind the primary to ever catch up. The secondary will need a full resync to catch up with the primary.

Performing rolling maintenance can mean that we’ll be pulling secondaries out of the replica set long enough to run compactions or build indexes. Because these operations can be time consuming, we need make sure that the primary’s oplog is big enough to hold all the writes that happened while the secondary is down. This means that it is more helpful to speak in units of time when describing the capacity of the oplog rather than the amount of actual data it holds. Prior to starting a maintenance task, we need to check that there is enough time in the oplog to complete the operation on the secondary. We can check how much time is on the oplog by executing the following command through the mongo shell:

foo:PRIMARY> rs.printReplicationInfo()
configured oplog size:   192MB
log length start to end: 897secs (0.25hrs)
oplog first event time:  Wed Jul 15 2015 13:21:10 GMT-0700 (PDT)
oplog last event time:   Wed Jul 15 2015 13:36:07 GMT-0700 (PDT)
now:                     Wed Jul 15 2015 13:36:46 GMT-0700 (PDT)

The greater the size of the oplog, the more time your secondaries will have to perform their compactions and index builds without falling off the oplog. I recommend that your oplog be sized to contain 24 hours of writes at a minimum. While a compaction or an index build is not likely to take that much time, the oplog is providing a safety margin for node failures of all types. For instance, data centers can be subject network failures, power outages or even fires which can cause a node to become isolated from the rest of the replica set for several hours. The 24 hour minimum gives the affected node a chance to catch up once the issue is resolved. An oplog of up to three to five days is a common size among our users.

If you can increase the size of your oplog either by use of a special rolling maintenance procedure of its own, or with use of MongoDB’s Using Cloud Manager. Cloud Manager is extremely useful tool when managing clusters as it simplifies the task though automation. More on that in a moment.

Stepping Down the Primary Predictably

As we continue our rolling maintenance, going from one secondary to the next, it will eventually be the primary's turn. We'll want to step-down the primary gracefully, handing off primary status to a secondary before stopping the mongod process.

Commanding the primary to relinquish its title can be done easily with the rs.stepDown() command from the mongo shell. Here's an example:

replset:PRIMARY> rs.stepDown()
2015-06-16T16:01:25.298-0400 I NETWORK  DBClientCursor::init call() failed
2015-06-16T16:01:25.303-0400 E QUERY    Error: error doing query: failed
    at DBQuery._exec (src/mongo/shell/query.js:83:36)
    at DBQuery.hasNext (src/mongo/shell/query.js:240:10)
    at DBCollection.findOne (src/mongo/shell/collection.js:187:19)
    at DB.runCommand (src/mongo/shell/db.js:58:41)
    at DB.adminCommand (src/mongo/shell/db.js:66:41)
    at Function.rs.stepDown (src/mongo/shell/utils.js:1001:43)
    at (shell):1:4 at src/mongo/shell/query.js:83
2015-06-16T16:01:25.305-0400 I NETWORK  trying reconnect to 127.0.0.1:27017 (127.0.0.1) failed
2015-06-16T16:01:25.305-0400 I NETWORK  reconnect 127.0.0.1:27017 (127.0.0.1) ok
replset:SECONDARY> 

Notice that the mongod process dropped connections as a result of losing its Primary status. This is expected and normal. Dropping connections forces all currently connected clients to reconnect and thus refresh their understanding of the replica set status. This prevents clients from errantly sending writes to a former primary.

We can control which secondary node will become the primary with the use of priority settings in the replica set configuration. The node with the highest priority will become the primary, as long as its replication is up to date.

For example, say we've deployed our replica set across multiple data centers, locating 2 nodes in data center A and one node in data center B. Our application servers are located in center A, and we prefer to keep the primary node in data center A to minimize latency between the primary and application servers. If the nodes were all configured with the same priority, it's possible that the node in data center B could be elected to primary. When we configure the nodes in data center A with a higher priority than the node in data center B, we can assure the primary stays in data center A after a step-down.

An additional note, when you issue an rs.stepDown() to the highest priority node in a replica set it will relinquish its primary status and stay ineligible to regain primary status for 60 seconds. After 60 seconds the highest priority node will again become eligible for primary and trigger an election. If for some reason you can’t shutdown the server within these 60 seconds, the high priority node may become primary yet again. The rs.stepDown() command accepts a ‘timeout’ parameter which you may use to keep the high priority node staying ineligible for primary for an explicit amount of time. For instance, the following command

replset:PRIMARY> rs.stepDown( 120 )

tells the primary to relinquish primary status and stay in secondary status for at least two minutes.

Maintaining Fault Tolerance During Maintenance

While secondaries run in stand-alone mode they are unavailable as a failover node. Secondaries undergoing their maintenance are effectively outside the replica set membership and can’t provide any fail-over support. Ops teams need to plan for this reduced availability of the cluster. Consider the following replica set, configured with two data bearing nodes and an arbiter. This replica set is perfectly fine production deployment which is fault tolerant, but it is a deployment which can be considered only minimally available.

Should Murphy’s Law strike and a node happen to fail while a secondary was already in stand-alone mode, the replica set will suffer the loss of a majority of nodes of the replica set. The remaining node can’t retain Primary status in this situation and must step-down to secondary status.

The rules for electing a primary require that a candidate node must be in healthy communication with a majority of node in the replica set. If the primary can’t maintain contact with it’s secondaries it has to assume that it is isolated and must step-down. This rule prevents split-brain scenarios where there are erroneously two primaries in a set. When the replica set of 3 nodes is undergoing maintenance, any loss of an additional node will result in a loss of availability. A larger replica set, of five nodes for instance, won’t have this problem since that set can lose up to two nodes and still maintain a majority.

For smaller replica sets of just three nodes, consider swapping out the target secondary with a temporary arbiter node. The arbiter serves to replace the secondary who is undergoing maintenance and will retain the replica set majority required to elect a primary. The swap out involves both removing the target secondary from the replica set while adding in the new arbiter, and can be done in a single step with use of the rs.reconfig() command. Read the docs for more information.

Take Care When Building Indexes

Bear in mind that when building indexes in a rolling fashion, you’ll be declaring the index on each node as a set of identical but totally independent commands. It pays to make sure you declare the indexes on each node carefully and consistently. Declaring indexes on a node is easy, but we do see people caught in pitfalls if they inadvertently fat-finger the the command. Take this example:

As a DBA I want to declare a unique index on field ‘a’ on the “test” collection. I’ll build my index across the following three node replica set, in a rolling fashion to avoid any loss of performance.

I begin the index building process by restarting node C in stand-alone mode and issue the ensureIndex command.

After the index build on node C is complete, I restart it with replica set mode and allow the node to catch up in replication. Once node C is caught up I repeat the process on node A. Notice, however, that I’ve forgotten to declare the index with the unique constraint. This means that duplicate values will be allowed on node A, but not on node C. The problems this will cause may not yet be immediately apparent, but trouble is coming.

At this point, I have built my indexes on two of the three nodes in the replica set. The primary is the last node which requires maintenance. I issue an rs.stepDown() to node B to cause it to relinquish its status as Primary. An election follows, resulting in node A becoming primary. Node B is restarted in stand alone and the final index build is started. This time I didn’t forget the unique constraint.

At this point all the indexes are built on each node, but the index on node A lacks the unique constraint. Node A is also a the replica set primary, so it will accept insertions of new documents where the field ‘a’ is duplicated. The indexes on Nodes B and C do have the unique constraint which prohibits them from accepting the replicated inserts.

The secondaries can’t replicate from the primary and automatically shut down with the following error in the log

E11000 duplicate key error index

Once the secondaries shut down, node A will be the last running node in the replica set, and must relinquish its status as primary. Node A falls back into secondary status.

This is clearly an undesirable state to have your cluster in. To recover from this situation node A will need to be shutdown before node B and C can be restarted. Once B and C are up a Primary will be elected from one of two and the availability of the replica set will be restored. However, getting the cluster back up and operational may not be the biggest headache you’ll face. During the time that node A was primary it accepted writes which could not be replicated on nodes B and C. What should be done with those writes is wholly dependent on your specific use case and the nature of your data, but in either case please remember this situation can and should be avoided by declaring indexes carefully.

Using scripts can help avoid fat-fingering index declarations. For instance, the following terminal command line can be used on each iterative index build.

$ mongo myDatabase --port 27018 --eval "db.myCollection.ensureIndex( { a: 1 }, { unique: true })"

This declares a new unique index on field ‘a’ in the namespace myDatabase.myCollection. We’ve used the mongo shell’s --eval parameter to pass the index declaration command to the node. However, I strongly suggest putting this command line inside a parameterized shell script, where the index declaration is static code. A simple script like the following example prevents avoid getting the declaration wrong.

#! /bin/bash
<p>if [ ! $# -eq 2 ]; then
echo "USAGE 2 parameters required: 'hostname' 'port'";
exit -1;
fi</p>
<p>$ mongo myDatabase --host $1 --port $2 --eval "db.myCollection.ensureIndex( { a: 1 }, { unique: true })"

Please note that while this script prevents me from getting the index declaration wrong, it doesn’t do any more than that. I still need to make sure I won’t run out of oplog, the node has been restarted in stand-alone mode, and I am not inadvertently executing the index declaration to the primary node.

Make It Easy For Yourself

MongoDB Cloud Manager greatly reduces the complexity of maintenance tasks by performing several of these tasks automatically. Cloud Manager already supports upgrading your cluster’s version of MongoDB, as well as automating the addition or removal of nodes in the cluster. Index builds are planned for future releases of Cloud Manager. Cloud Manager also integrates monitoring, alerting, database backup, and restore services into one easy to use interface. Cloud Manager eases the execution of maintenance tasks, boosting confidence and lessens the possibility of human error. Plenty of additional information on Cloud Manager is available here, where you can also sign up for a free 30 day trial.

Just Keep It Rollin’ Along

There you have it. Rolling maintenance will liberate your weekends, keep your application up, and make your life sublime. Just remember to have a proper plan in place and you’ll be able maintain your cluster with minimal loss of availability and high confidence. There’s a ton of additional information and learning resources to help you implement your maintenance plan.

Further detail on building indexes on replica sets is available.

Additional details on performing maintenance on replica sets is also available.


If you want to learn more about running a MongoDB system in production, please check out our advanced operations online course:

Register now


About the Author - Bryan
Bryan Reinero is US Developer Advocate at MongoDB fostering understanding and engagement in the community. Previously Bryan was a Senior Consulting Engineer at MongoDB, helping users optimize MongoDB for scale and performance and a contributor to the Java Driver for MongoDB. Earlier, Bryan was Software Engineering Manager at Valueclick, building and managing large scale marketing applications for advertising, retargeting, real-time bidding and campaign optimization. Earlier still, Bryan specialized in software for embedded systems at Ricoh Corporation and developed data analysis and signal processing software at the Experimental Physics Branch of Ames Research Center.