Resumable Initial Sync in MongoDB 4.4
Rate this article
Hello, everyone. My name is Nuno and I have been working with MongoDB databases for almost eight years now as a sysadmin and as a Technical Services Engineer.
One of the most common challenges in MongoDB environments is when a replica set member requires a resync and the Initial Sync process is interrupted for some reason.
Interruptions like network partitions between the sync source and the node doing the initial sync causes the process to fail forcing it to restart from scratch to ensure database consistency.
This began to be particularly problematic when faced with a large dataset sizes which can take up to several days when they are in terms of terabytes.
You may have already noticed that I am talking in the past tense as this is no longer a problem you need to face. I am very happy to share with you one of the latest enhancements introduced by MongoDB in v4.4: Resumable Initial Sync.
Resumable Initial Sync now enables nodes doing initial sync to survive events like transient network errors or a sync source restart when fetching data from the sync source node.
The time spent when recovering replica set members with Initial Sync procedures on large data environments has two common challenges:
- Falling off the oplog
- Transient network failures
MongoDB became more resilient to these types of failures with MongoDB v3.4 by adding the ability to pull newly added oplog records during the data copy phase, and more recently with MongoDB v4.4 and the ability to resume the initial sync where it left off.
The initial sync process will restart the interrupted or failed command and keep retrying until the command succeeds a non-resumable error occurs, or a period specified by the parameter initialSyncTransientErrorRetryPeriodSeconds passes (default: 24 hours). These restarts are constrained to use the same sync source, and are not tolerant to rollbacks on the sync source. That is if the sync source experiences a rollback, the entire initial sync attempt will fail.
Resumable errors include retriable errors when
ErrorCodes::isRetriableError
return true
which includes all network errors as well as some other transient errors.The
ErrorCodes::NamespaceNotFound
, ErrorCodes::OperationFailed
, ErrorCodes::CursorNotFound
, or ErrorCodes::QueryPlanKilled
mean the collection may have been dropped, renamed, or modified in a way which caused the cursor to be killed. These errors will cause ErrorCodes::InitialSyncFailure
and will be treated the same as transient retriable errors (except for not killing the cursor), mark ErrorCodes::isRetriableError
as true
, and will allow the initial sync to resume where it left off.On
ErrorCodes::NamespaceNotFound
, it will skip this entire collection and return success. Even if the collection has been renamed, simply resuming the query is sufficient since we are querying by UUID
; the name change will be handled during oplog
application.All other errors are
non-resumable
.The default retry period is 24 hours (86,400 seconds). A database administrator can choose to increase this period with the following command:
Note: The 24-hour value is the default period estimated for a database administrator to detect any ongoing failure and be able to act on restarting the sync source node.
The full resumable behavior will always be available between 4.4 nodes regardless of FCV - Feature Compatibility Version. Between 4.2 and 4.4 nodes, the initial sync will not be resumable during the query phase of the
CollectionCloner
(where we are actually reading data from collections), nor will it be resumable after collection rename, regardless of which node is 4.4. Resuming after transient failures in other commands will be possible when the syncing node is 4.4 and the sync source is 4.2.During initial sync, the sync source node can become unavailable (either due to a network failure or process restart) and still, be able to resume and complete.
Here are examples of what messages to expect in the logs.
Initial Sync attempt successfully started:
Messages caused by network failures (or sync source node restart):
Initial Sync is resumed after being interrupted:
Data cloners resume:
Data cloning phase completes successfully. Oplog cloning phase starts:
Initial Sync completes successfully and statistics are provided:
The new InitialSync statistics from replSetGetStatus.initialSyncStatus can be useful to review the initial sync progress status.
Starting in MongoDB 4.2.1, replSetGetStatus.initialSyncStatus metrics are only available when run on a member during its initial sync (i.e., STARTUP2 state).
The metrics are:
- syncSourceUnreachableSince - The date and time at which the sync source became unreachable.
- currentOutageDurationMillis - The time in milliseconds that the sync source has been unavailable.
- totalTimeUnreachableMillis - The total time in milliseconds that the member has been unavailable during the current initial sync.
- totalTimeUnreachableMillis - The total time in milliseconds that the member has been unavailable during the current initial sync.
- operationsRetried - Total number of all operation retry attempts.
- rollBackId - The sync source's rollback identifier at the start of the initial sync attempt.
An example of this output is:
Upgrade your MongoDB database to the new v4.4 and take advantage of the new Resumable Initial Sync feature. Your deployment will now survive transient network errors or a sync source restarts.
If you have questions, please head to our developer community website where the MongoDB engineers and the MongoDB community will help you build your next big idea with MongoDB.