Secondaries randomly stop syncing

Over the past month or so, our (v3.2) sharded mongo cluster has had some issues with some of our secondary replicas randomly stopping sync.

Our cluster is 3x3 – we have three shards running across three machines. Each machine runs the primary replica for one shard and a secondary for each of the other two shards, as well as a config replica. We have about 2.2TB of data at the moment (replicated three times; i.e. each machine holds about 2.2TB across its primary + two secondaries). The primary symptom is that the secondary simply stops replicating, i.e. the optimeDate for it in rs.status( ) simply freezes at a given time. Sometimes restarting the secondary will kick it back to functioning properly, but typically it just results in the optimeDate bumping forward a little bit – presumably corresponding to one batch from the oplog – but then halting again. Before too long, of course, this results in the secondary dropping too far behind the primary and becoming unusable. The secondary also has fairly high CPU usage while in this state (even though it doesn’t appear to actually be replicating anything); typically around 2-3x the CPU usage our secondaries tend to have, close to as high as the primaries tend to run.

Here’s the last bit of logs from the last time this happened:

2021-07-25T22:04:23.705+0000 I COMMAND  [conn68] command local.oplog.rs command: getMore { getMore: 14911775621, collection: "oplog.rs", maxTimeMS: 5000, term: 157, lastKnownCommittedOpTime: { ts: Timestamp 1627250639000|149, t: 157 } } cursorid:14911775621 keyUpdates:0 writeConflicts:0 exception: operation exceeded time limit code:50 numYields:0 reslen:126 locks:{ Global: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 7211294 } }, Database: { acquireCount: { r: 1 } }, oplog: { acquireCount: { r: 1 } } } protocol:op_command 7352ms
2021-07-25T22:04:42.992+0000 I NETWORK  [conn67] end connection 10.240.0.111:40148 (10 connections now open)
2021-07-25T22:04:51.447+0000 I NETWORK  [conn26] end connection 10.240.0.111:35184 (9 connections now open)
2021-07-25T22:05:04.939+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-production-coredata-shard-3:27023 because the pool meets constraints; 1 connections to that host remain open
2021-07-25T22:05:12.208+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-production-coredata-shard-3:27023
2021-07-25T22:05:17.179+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-production-coredata-shard-3:27023, took 4972ms (2 connections now open to mongodb-production-coredata-shard-3:27023)
2021-07-25T22:06:40.799+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-production-coredata-shard-3:27023 because the pool meets constraints; 1 connections to that host remain open
2021-07-25T22:07:39.384+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-production-coredata-shard-3:27023
2021-07-25T22:07:39.431+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-production-coredata-shard-3:27023, took 50ms (2 connections now open to mongodb-production-coredata-shard-3:27023)
2021-07-25T22:09:17.637+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-production-coredata-shard-3:27023 because the pool meets constraints; 1 connections to that host remain open
2021-07-25T22:09:47.330+0000 I NETWORK  [conn68] end connection 10.240.0.111:40302 (8 connections now open)
2021-07-25T22:10:56.862+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-production-coredata-shard-3:27023
2021-07-25T22:10:56.882+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-production-coredata-shard-3:27023, took 20ms (2 connections now open to mongodb-production-coredata-shard-3:27023)
2021-07-25T22:11:56.966+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-production-coredata-shard-3:27023 because the pool meets constraints; 1 connections to that host remain open
2021-07-25T22:12:04.575+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-production-coredata-shard-3:27023
2021-07-25T22:12:04.607+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-production-coredata-shard-3:27023, took 32ms (2 connections now open to mongodb-production-coredata-shard-3:27023)
2021-07-25T22:13:04.607+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-production-coredata-shard-3:27023 because the pool meets constraints; 1 connections to that host remain open
2021-07-25T22:13:34.753+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-production-coredata-shard-3:27023
2021-07-25T22:13:34.828+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-production-coredata-shard-3:27023, took 75ms (2 connections now open to mongodb-production-coredata-shard-3:27023)
2021-07-25T22:14:34.827+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-production-coredata-shard-3:27023 because the pool meets constraints; 1 connections to that host remain open
2021-07-25T22:15:00.168+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-production-coredata-shard-3:27023
2021-07-25T22:15:00.243+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-production-coredata-shard-3:27023, took 76ms (2 connections now open to mongodb-production-coredata-shard-3:27023)
2021-07-25T22:16:00.247+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-production-coredata-shard-3:27023 because the pool meets constraints; 1 connections to that host remain open
2021-07-25T22:16:21.278+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-production-coredata-shard-3:27023
2021-07-25T22:16:21.344+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-production-coredata-shard-3:27023, took 66ms (2 connections now open to mongodb-production-coredata-shard-3:27023)
2021-07-25T22:17:43.956+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-production-coredata-shard-3:27023 because the pool meets constraints; 1 connections to that host remain open
2021-07-25T22:17:55.600+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-production-coredata-shard-3:27023
2021-07-25T22:17:55.610+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-production-coredata-shard-3:27023, took 10ms (2 connections now open to mongodb-production-coredata-shard-3:27023)
2021-07-25T22:21:34.256+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-production-coredata-shard-3:27023 because the pool meets constraints; 1 connections to that host remain open
2021-07-25T22:21:50.873+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-production-coredata-shard-3:27023
2021-07-25T22:21:50.884+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-production-coredata-shard-3:27023, took 12ms (2 connections now open to mongodb-production-coredata-shard-3:27023)
2021-07-25T22:22:50.884+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-production-coredata-shard-3:27023 because the pool meets constraints; 1 connections to that host remain open
2021-07-25T22:24:12.166+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Connecting to mongodb-production-coredata-shard-3:27023
2021-07-25T22:24:12.183+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Successfully connected to mongodb-production-coredata-shard-3:27023, took 17ms (2 connections now open to mongodb-production-coredata-shard-3:27023)
2021-07-25T22:25:46.021+0000 I ASIO     [NetworkInterfaceASIO-BGSync-0] Ending idle connection to host mongodb-production-coredata-shard-3:27023 because the pool meets constraints; 1 connections to that host remain open

The timeout on the oplog query (presumably due to either a network hiccup or excessive load) approximately coincides with the time that replication halted. Shouldn’t the secondary then retry? As you can see, there are no further oplog-related logs. I actually shut down this instance a few hours later; there were no logs for the last couple hours. I’m also not sure what to think about the “Ending idle connection” logs; I’m not sure they’re related, since we’ve been getting them lately even on functioning secondaries, but I’m pretty sure they’re new because I don’t recall seeing them (at least not to this degree) in the past.

Is there perhaps a known issue (possibly with 3.2 specifically) related to oplog resumption? I tried searching for anything like this but couldn’t quite find anything that looked the same. If not, where should I start looking to continue investigating? Any advice would be very appreciated.

Welcome to the MongoDB Community Forums @Nick_Shelton!

I don’t see mention of a specific release version of MongoDB 3.2, but if you aren’t running the final 3.2.22 version I would definitely upgrade. Minor releases within the same production release series (eg 3.2.x) do not introduce any backward breaking compatibility changes.

MongoDB 3.2 reached End of Life (EOL) in 2018 and is no longer maintained or supported. There have been many changes in successive major releases since 3.2’s debut in 2015 but the behaviour you described is not a common issue I’m aware of affecting modern versions of MongoDB.

I strongly recommend planning an upgrade to a supported version of MongoDB (currently 4.0 or later) as there have been significant improvements in features & stability.

Regards,
Stennie