Mongo replication stalls

Fory_Horio · March 15, 2024, 6:06pm

A few weeks ago, we upgraded our MongoDB sharded replication environment from 4.2 to 4.4.28. After the upgrade, we observed a mongo replication stalls all the sudden on our main production.
I am seeing this error
{“t”:{“$date”:“2024-03-13T14:38:13.265+00:00”},“s”:“I”, “c”:“REPL”, “id”:21275, “ctx”:“ReplCoordExtern-1”,“msg”:“Recreating cursor for oplog fetcher due to error”,“attr”:{“lastOpTimeFetched”:{“ts”:{“$timestamp”:{“t”:1710328506,“i”:1899}},“t”:99},“attemptsRemaining”:1,“error”:“CursorNotFound: Error while getting the next batch in the oplog fetcher :: caused by :: cursor id 4877507614192920788 not found”}}

After the research, we upgraded Mongo from 4.4.28 to 4.4.29 two days ago based on the report described in

https://jira.mongodb.org/browse/SERVER-70155

Which claims that the bug is fixed. But this morning, 4AM my time, the replication stalled again. This issue is marked as fixed, but I don’t think it does.

Fory_Horio · March 15, 2024, 6:11pm

FYI, I was running “compact” against 3TB collection. Would that cause the problem? Our mongo 4.4.28 replication had stalled few times even we weren’t running the “compact”
This happens on a secondary, and once the replication stalls, the CUP & disk usage go almost nil.

Kobe_W · March 15, 2024, 8:57pm

Look like the secondary is not able to continue replication from source machine’s oplog. E.g. the entry in oplog has been removed from source machine.

Compact is supposed not to cause an insertion to oplog entry (i may be wrong though).

Is you replication source machine handling of a lot of writes? is the oplog window big enough? There are options to configure oplog retention period.

Is this “stall” issue happening in peak traffic time or low traffic time?

Fory_Horio · March 15, 2024, 9:03pm

We have constant traffic 24/7. It does have lot of writes, but it is constantly busy. It does not happen when Mongo is too busy, just all the sudden. We have about 22 hours of OpLog. When I restart the mongo process, the replication process starts again. So, we weren’t missing the OpLog size window,
Again, this wasn’t an issue when we were running on 4.2 for 1-2 years, before that we were on 3.6, before that were on 3.2, then 2.8. The server has been running for 6 -7 years.