Resumable Initial Sync in MongoDB 4.4
Rate this article
Hello, everyone. My name is Nuno and I have been working with MongoDB databases for almost eight years now as a sysadmin and as a Technical Services Engineer.
Interruptions like network partitions between the sync source and the node doing the initial sync causes the process to fail forcing it to restart from scratch to ensure database consistency.
This began to be particularly problematic when faced with a large dataset sizes which can take up to several days when they are in terms of terabytes.
Resumable Initial Sync now enables nodes doing initial sync to survive events like transient network errors or a sync source restart when fetching data from the sync source node.
- Falling off the oplog
- Transient network failures
The initial sync process will restart the interrupted or failed command and keep retrying until the command succeeds a non-resumable error occurs, or a period specified by the parameter passes (default: 24 hours). These restarts are constrained to use the same sync source, and are not tolerant to rollbacks on the sync source. That is if the sync source experiences a rollback, the entire initial sync attempt will fail.
Resumable errors include retriable errors when
true which includes all network errors as well as some other transient errors.
ErrorCodes::QueryPlanKilled mean the collection may have been dropped, renamed, or modified in a way which caused the cursor to be killed. These errors will cause
ErrorCodes::InitialSyncFailure and will be treated the same as transient retriable errors (except for not killing the cursor), mark
true, and will allow the initial sync to resume where it left off.
ErrorCodes::NamespaceNotFound, it will skip this entire collection and return success. Even if the collection has been renamed, simply resuming the query is sufficient since we are querying by
UUID; the name change will be handled during
All other errors are
The default retry period is 24 hours (86,400 seconds). A database administrator can choose to increase this period with the following command:
Note: The 24-hour value is the default period estimated for a database administrator to detect any ongoing failure and be able to act on restarting the sync source node.
The full resumable behavior will always be available between 4.4 nodes regardless of . Between 4.2 and 4.4 nodes, the initial sync will not be resumable during the query phase of the
CollectionCloner (where we are actually reading data from collections), nor will it be resumable after collection rename, regardless of which node is 4.4. Resuming after transient failures in other commands will be possible when the syncing node is 4.4 and the sync source is 4.2.
During initial sync, the sync source node can become unavailable (either due to a network failure or process restart) and still, be able to resume and complete.
Here are examples of what messages to expect in the logs.
Initial Sync attempt successfully started:
Messages caused by network failures (or sync source node restart):
Initial Sync is resumed after being interrupted:
Data cloners resume:
Data cloning phase completes successfully. Oplog cloning phase starts:
Initial Sync completes successfully and statistics are provided:
The metrics are:
An example of this output is:
Upgrade your MongoDB database to the new v4.4 and take advantage of the new Resumable Initial Sync feature. Your deployment will now survive transient network errors or a sync source restarts.