Errors during initial sync oplog application

Marco_101 · October 28, 2023, 1:57pm

Hello, after having had to make a member of my replica set (1 primary, 2 secondaries. Version 4.2) go through an initial sync, I found this error in the log:

I INITSYNC [replication-1] Finished cloning data: OK. Beginning oplog replay.
I INITSYNC [replication-1] Writing to the oplog and applying operations until {  } before initial sync can complete.
E REPL [repl-writer-worker-24] failed to apply update:
I REPL [repl-writer-worker-24] Fetching missing document:
I REPL [repl-writer-worker-24] Missing document not found on source; presumably deleted later in oplog.

This error (failed to apply update) appears about 18000 times after the cloning stage and it is always followed by the other two messages. I would like your opinion regarding it. Some additional info:

The sync took a little over 25 hours and copied about 1TB of data. I should note that the newly synced member ended up having a data directory 300GB smaller than the other two, despite having the same numer of objects, average object size and overall size of data of the other two members (db.stats() of the main database). I’m assuming that this is because the other two members have a lot of empty space in their data directory and it didn’t copy that space during the sync. The set seems healthy: it’s become a secondary and isn’t behind on the replication.

I believe that the ‘failed to apply update’ errors appear because during the sync, the target member copied the source’s oplog entries regarding, for example, collection X before actually cloning said collection. And so when it started applying the oplog, it applied updates that were obsolete, since those documents were deleted from the sync source before they were cloned.

Thus I think that these errors can be ignored but I’m not 100% certain. I hope that someone has seen these errors before and has addition info about them, or that someone has more knowledge about the initial sync process.

I would also appreciate if someone could go into detail about how the oplog is buffered and then applied during the sync, and if the size of the oplog really matters that much during this process. I initially thought it did but recently learned that the source’s oplog is tailed during the cloning phase (into the temp_oplog_buffer collection maybe?).

Thank you

Aasawari · November 9, 2023, 8:13am

Hi @Marco_101 and welcome to MongoDB community forums!!

Firstly, The MongoDB 4.2 is quite outdated now since it reached end of life in April 2023, and would not undergo any updates in the future. There, I would recommend you you to upgrade the deployment to the latest version for new features and bug fixes.

The issue that you are observing has been seen in the past with older versions, https://jira.mongodb.org/browse/SERVER-18721.
The recommendation would be to use rs.add() to add a new member in the replica set configuration.

As mentioned in the MongoDB documentation for initial sync,

Clones all databases except the local database

which might be a cause of the same volume of data not being seen.

Prior to MongoDB version 4.4, as mentioned in the legacy documentation for oplog, the oplog size can grow beyond the configured size to avoid deleting the majority commit points.
You might also want to take a look at the Scenarios that might result into larger oplog size for better understanding..

However, in MongoDB version 4.4. and above, the oplog size has a retention period and will remove the oplog entry if:

The oplog has reached the maximum configured size, and
The oplog entry is older than the configured number of hours.

The oplog entry is tailed until the majority commit point is reached prior MongoDB version 4.4 and normally operated in the rolling fashion.

I would recommend you upgrading the MongoDB and then performing the same steps for initial sync to take place.
Please don’t hesitate to reach out if you are facing the issue after the version upgrade.

Warm regards
Aasawari

Marco_101 · November 12, 2023, 2:45pm

Hello @Aasawari, first of all thank you very much for responding and for the info on the oplog. I learned a couple of things I didn’t know.

The issue that you are observing has been seen in the past with older versions, https://jira.mongodb.org/browse/SERVER-18721.

This is very interesting. I looked at the source’s log during the fetching of missing documents and indeed there was a scary amount of connections being opened during that small window of time.

One more thing: can you tell me whether errors such as ‘failed to apply update’ are normal during this phase of the sync? Since they are followed by ‘Missing document not found on source; presumably deleted later in oplog’ can I assume that the data that was cloned already underwent these updates and was subsequently deleted on the sync source?

Thank you.