Replica set health is more than just replication lag



“Replication lag,” which is a measure of how far a secondary is behind the primary, is the first indicator you should examine to understand the health of your replica sets. Ideally you will have no lag so that should your primary fail, all data will be available when the most up-to-date secondary is elected to take over. If you use MMS, you can set an alert so that you are informed when replication lag exceeds a particular threshold. If you do find that your lag is significant, you need to delve into why your secondaries cannot keep up with the volume.

You also need to understand how the size of your oplog and the volume of operations you handle impacts your infrastructure. The oplog has a fixed size (you can modify it, but it requires a mongodb shutdown), and is constantly overwriting the oldest entry with the newest. The bigger your oplog, the more operations can fit into it, while the higher your operation volume, the more quickly it will overwrite the older entries.

This gives rise to a metric MMS calls “replication oplog window” – the time difference between the oldest and newest entries in your oplog. That’s the amount of time an operation will remain in the oplog before being overwritten by a new entry. There are two consequences you must be aware of that emerge from this number.

Firstly, your replication oplog window tells you how long a secondary member can be offline and still catch up to the primary without doing a full resync. Once you exceed that time, oplog entries that have not yet replicated get overwritten and cannot be applied. Since it’s much slower to do a full DB copy than to catch up using the oplog, knowing that time frame can inform your operations policy regarding time to repair down secondaries.

The second, and more subtle, issue is that replication oplog window is also the maximum amount of time it can take to perform the initial phase of a full sync (either when adding a new secondary, or fully resyncing a stale one). In this phase, the entire database is copied to the secondary, while the oplog keeps track of operations performed on the primary since the start of the copy. If it takes longer than your replication oplog window to copy the data from the primary to the secondary, then by the time that copy is done, the oplog will have lost track of data that wasn’t present in the initial copy. This means that you will not be able to *resync* any stale secondaries, or *add* any new secondaries! The only way to recover from this state is to shutdown the primary and allocate a larger oplog.

The default oplog size is generally large enough to prevent this from happening, but the larger your database, the longer that initial copy will take, and the higher your operation volume, the less time you have for that copy to complete. By knowing your oplog window and the time it takes to copy your entire database from the primary to a secondary, you can avoid that potentially disastrous pitfall.

Bear in mind that replication oplog window is not a constant value, it is constantly fluctuating in response to the volume of operations your replica set is handling. Under peak traffic, your replication oplog window will shorten, so it is crucial that in your capacity planning you prepare for the busiest data ingestion times when your window to recover is the lowest.

Here is an example of replication oplog window charts fluctuating with load:


Using MMS, you can set alerts on replication oplog window getting too low on your primary. This, combined with monitoring and alerting on replication lag, will ensure your replica sets stay healthy!