Mongo replication stalls

A few weeks ago, we upgraded our MongoDB sharded replication environment from 4.2 to 4.4.28. After the upgrade, we observed a mongo replication stalls all the sudden on our main production.
I am seeing this error
{“t”:{“$date”:“2024-03-13T14:38:13.265+00:00”},“s”:“I”, “c”:“REPL”, “id”:21275, “ctx”:“ReplCoordExtern-1”,“msg”:“Recreating cursor for oplog fetcher due to error”,“attr”:{“lastOpTimeFetched”:{“ts”:{“$timestamp”:{“t”:1710328506,“i”:1899}},“t”:99},“attemptsRemaining”:1,“error”:“CursorNotFound: Error while getting the next batch in the oplog fetcher :: caused by :: cursor id 4877507614192920788 not found”}}

After the research, we upgraded Mongo from 4.4.28 to 4.4.29 two days ago based on the report described in

Which claims that the bug is fixed. But this morning, 4AM my time, the replication stalled again. This issue is marked as fixed, but I don’t think it does.

FYI, I was running “compact” against 3TB collection. Would that cause the problem? Our mongo 4.4.28 replication had stalled few times even we weren’t running the “compact”
This happens on a secondary, and once the replication stalls, the CUP & disk usage go almost nil.

Look like the secondary is not able to continue replication from source machine’s oplog. E.g. the entry in oplog has been removed from source machine.

Compact is supposed not to cause an insertion to oplog entry (i may be wrong though).

Is you replication source machine handling of a lot of writes? is the oplog window big enough? There are options to configure oplog retention period.

Is this “stall” issue happening in peak traffic time or low traffic time?

We have constant traffic 24/7. It does have lot of writes, but it is constantly busy. It does not happen when Mongo is too busy, just all the sudden. We have about 22 hours of OpLog. When I restart the mongo process, the replication process starts again. So, we weren’t missing the OpLog size window,
Again, this wasn’t an issue when we were running on 4.2 for 1-2 years, before that we were on 3.6, before that were on 3.2, then 2.8. The server has been running for 6 -7 years.

We are still having this problem. The replication starts to fail 2 - 3 times a week. Restarting mongod fixes the problem. I am seeing this in the log if that helps.

{“t”:{“$date”:“2024-05-05T17:45:48.438+00:00”},“s”:“I”, “c”:“COMMAND”, “id”:20499, “ctx”:“ftdc”,“msg”:“serverStatus was very slow”,“attr”:{“timeStats”:{“after basic”:0,“after asserts”:0,“after connections”:0,“after electionMetrics”:0,“after extra_info”:0,“after featureCompatibilityVersion”:0,“after flowControl”:0,“after globalLock”:0,“after indexBulkBuilder”:0,“after locks”:0,“after logicalSessionRecordCache”:0,“after mirroredReads”:0,“after network”:0,“after opLatencies”:0,“after opReadConcernCounters”:0,“after opcounters”:0,“after opcountersRepl”:0,“after oplog”:0,“after oplogTruncation”:0,“after repl”:0,“after scramCache”:0,“after security”:0,“after shardingStatistics”:0,“after storageEngine”:0,“after tcmalloc”:0,“after trafficRecording”:0,“after transactions”:0,“after transportSecurity”:0,“after twoPhaseCommitCoordinator”:0,“after wiredTiger”:1438,“at end”:1438}}}