MongoDB
MongoDB Developer Center
chevron-right
Developer Topics
chevron-right
Products
chevron-right
MongoDB
chevron-right

Resumable Initial Sync in MongoDB 4.4

Nuno CostaPublished Dec 16, 2021 • Updated May 16, 2022
MongoDB
facebook icontwitter iconlinkedin icon
random alt
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty

Introduction

Hello, everyone. My name is Nuno and I have been working with MongoDB databases for almost eight years now as a sysadmin and as a Technical Services Engineer.
Interruptions like network partitions between the sync source and the node doing the initial sync causes the process to fail forcing it to restart from scratch to ensure database consistency.
This began to be particularly problematic when faced with a large dataset sizes which can take up to several days when they are in terms of terabytes.
You may have already noticed that I am talking in the past tense as this is no longer a problem you need to face. I am very happy to share with you one of the latest enhancements introduced by MongoDB in v4.4:
Resumable Initial Sync
.
Resumable Initial Sync now enables nodes doing initial sync to survive events like transient network errors or a sync source restart when fetching data from the sync source node.

Resumable Initial Sync

The time spent when recovering replica set members with
Initial Sync
procedures on large data environments has two common challenges:
  • Falling off the oplog
  • Transient network failures
MongoDB became more resilient to these types of failures with
MongoDB v3.4
by adding the ability to pull newly added oplog records during the data copy phase, and more recently with
MongoDB v4.4
and the ability to resume the initial sync where it left off.

Behavioral Description

The initial sync process will restart the interrupted or failed command and keep retrying until the command succeeds a non-resumable error occurs, or a period specified by the parameter
initialSyncTransientErrorRetryPeriodSeconds
passes (default: 24 hours). These restarts are constrained to use the same sync source, and are not tolerant to rollbacks on the sync source. That is if the sync source experiences a rollback, the entire initial sync attempt will fail.
Resumable errors include retriable errors when ErrorCodes::isRetriableError return true which includes all network errors as well as some other transient errors.
The ErrorCodes::NamespaceNotFound, ErrorCodes::OperationFailed, ErrorCodes::CursorNotFound, or ErrorCodes::QueryPlanKilled mean the collection may have been dropped, renamed, or modified in a way which caused the cursor to be killed. These errors will cause ErrorCodes::InitialSyncFailure and will be treated the same as transient retriable errors (except for not killing the cursor), mark ErrorCodes::isRetriableError as true, and will allow the initial sync to resume where it left off.
On ErrorCodes::NamespaceNotFound, it will skip this entire collection and return success. Even if the collection has been renamed, simply resuming the query is sufficient since we are querying by UUID; the name change will be handled during oplog application.
All other errors are non-resumable.

Configuring Custom Retry Period

The default retry period is 24 hours (86,400 seconds). A database administrator can choose to increase this period with the following command:
Note: The 24-hour value is the default period estimated for a database administrator to detect any ongoing failure and be able to act on restarting the sync source node.

Upgrade/Downgrade Requirements and Behaviors

The full resumable behavior will always be available between 4.4 nodes regardless of
FCV - Feature Compatibility Version
. Between 4.2 and 4.4 nodes, the initial sync will not be resumable during the query phase of the CollectionCloner (where we are actually reading data from collections), nor will it be resumable after collection rename, regardless of which node is 4.4. Resuming after transient failures in other commands will be possible when the syncing node is 4.4 and the sync source is 4.2.

Diagnosis/Debuggability

During initial sync, the sync source node can become unavailable (either due to a network failure or process restart) and still, be able to resume and complete.
Here are examples of what messages to expect in the logs.
Initial Sync attempt successfully started:
Messages caused by network failures (or sync source node restart):
Initial Sync is resumed after being interrupted:
Data cloners resume:
Data cloning phase completes successfully. Oplog cloning phase starts:
Initial Sync completes successfully and statistics are provided:
The new InitialSync statistics from
replSetGetStatus.initialSyncStatus
can be useful to review the initial sync progress status.
Starting in MongoDB 4.2.1,
replSetGetStatus.initialSyncStatus
metrics are only available when run on a member during its initial sync (i.e.,
STARTUP2
state).
The metrics are:
For each Initial Sync attempt from
replSetGetStatus.initialSyncStatus.initialSyncAttempts
:
  • totalTimeUnreachableMillis
    - The total time in milliseconds that the member has been unavailable during the current initial sync.
  • operationsRetried
    - Total number of all operation retry attempts.
  • rollBackId
    - The sync source's rollback identifier at the start of the initial sync attempt.
An example of this output is:

Wrap Up

Upgrade your MongoDB database to the new v4.4 and take advantage of the new Resumable Initial Sync feature. Your deployment will now survive transient network errors or a sync source restarts.
If you have questions, please head to our
developer community website
where the MongoDB engineers and the MongoDB community will help you build your next big idea with MongoDB.

Copy Link
facebook icontwitter iconlinkedin icon
Rate this article
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial
Kafka to MongoDB Atlas End to End Tutorial

May 13, 2022
Podcast
At the Intersection of AI/ML and HCI with Douglas Eck of Google (MongoDB Podcast)

May 16, 2022
Article
Capture IoT Data With MongoDB in 5 Minutes

May 13, 2022
Article
Paginations 1.0: Time Series Collections in five minutes

May 19, 2022
Table of Contents