Production Environment Faults:failed to get connections from primary

We run 3 shards of each 3 “mongod” servers each:
image
MongoDB version: 4.2.10
Docker version: 19.03.8, build afacb8b7f0
OS: Ubuntu 18.04.4

Everything worked well but some days ago, a failure occurred: witout any sign , mongo-1(primay) could’t return connections to clients . The logs of mongos and Secondary printed lots of “NetworkInterfaceExceededTimeLimit” exception.

The logs of mongo-1(primary), during the problem, it only printed logs about connection:

2021-09-29T10:08:55.921+08:00 2021-09-29T02:08:55.921+0000 I NETWORK [listener] connection accepted from ip:57192 #42365 (178 connections now open)
2021-09-29T10:08:55.968+08:00 2021-09-29T02:08:55.968+0000 I NETWORK [listener] connection accepted from ip:36974 #42366 (179 connections now open)
2021-09-29T10:08:55.968+08:00 2021-09-29T02:08:55.968+0000 I ACCESS [conn42359] Successfully authenticated as principal __system on local from client ip:36962
2021-09-29T10:08:55.969+08:00 2021-09-29T02:08:55.968+0000 I NETWORK [listener] connection accepted from ip:36976 #42367 (180 connections now open)
2021-09-29T10:08:55.969+08:00 2021-09-29T02:08:55.969+0000 I NETWORK [conn42366] received client metadata from ip:36974 conn42366: { driver: { name: “NetworkInterfaceTL”, version: “4.2.10” }, os: { type: “Linux”, name: “Ubuntu”, architecture: “x86_64”, version: “18.04” } }

2021-09-29T10:21:05.976+08:00 2021-09-29T02:21:05.976+0000 I NETWORK [listener] connection accepted from ip:41604 #59742 (16272 connections now open)
2021-09-29T10:21:05.976+08:00 2021-09-29T02:21:05.976+0000 I NETWORK [listener] connection accepted from ip:41602 #59741 (16271 connections now open)
2021-09-29T10:21:05.976+08:00 2021-09-29T02:21:05.976+0000 I NETWORK [conn59741] received client metadata from ip:41602 conn59741: { driver: { name: “NetworkInterfaceTL”, version: “4.2.10” }, os: { type: “Linux”, name: “Ubuntu”, architecture: “x86_64”, version: “18.04” } }

It kept receiving connections form clients but could’t handle them.
I tried to connect to mongod with MongoDB shell by ‘docker exec -it mongodb-shard mongo --port 27018’ but it doesn’t work, could’t connect to it.

Logs of mongo-rs-1 and mongo-rs-sec-1 (secondary) :

2021-09-29T10:09:05.217+08:00 2021-09-29T02:09:05.217+0000 I REPL [replication-1505] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback :: caused by :: Request 30396207 timed out, deadline was 2021-09-29T02:09:05.207+0000, op was RemoteCommand 30396207 – target:[mongo-1:27018] db:local expDate:2021-09-29T02:09:05.207+0000 cmd:{ getMore: 6531183952488528211, collection: “oplog.rs”, batchSize: 13981010, maxTimeMS: 5000, term: 81, lastKnownCommittedOpTime: { ts: Timestamp(1632881335, 6), t: 81 } }. Last fetched optime: { ts: Timestamp(1632881335, 6), t: 81 }. Restarts remaining: 1
2021-09-29T10:09:07.818+08:00 2021-09-29T02:09:07.818+0000 I REPL [replication-1505] Error returned from oplog query (no more query restarts left): MaxTimeMSExpired: error in fetcher batch callback :: caused by :: operation exceeded time limit
2021-09-29T10:09:07.820+08:00 2021-09-29T02:09:07.820+0000 W REPL [rsBackgroundSync] Fetcher stopped querying remote oplog with error: MaxTimeMSExpired: error in fetcher batch callback :: caused by :: operation exceeded time limit
2021-09-29T10:09:10.317+08:00 2021-09-29T02:09:10.317+0000 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo-1:27018: InvalidSyncSource: Sync source was cleared. Was mongo-1:27018
2021-09-29T10:09:37.822+08:00 2021-09-29T02:09:37.822+0000 I REPL [replication-1504] Blacklisting mongo-1:27018 due to error: ‘NetworkInterfaceExceededTimeLimit: Request 30396224 timed out, deadline was 2021-09-29T02:09:37.822+0000, op was RemoteCommand 30396224 – target:[mongo-1:27018] db:local expDate:2021-09-29T02:09:37.822+0000 cmd:{ find: “oplog.rs”, limit: 1, sort: { $natural: 1 }, projection: { ts: 1, t: 1 } }’ for 10s until: 2021-09-29T02:09:47.822+0000

Logs of mongos:

2021-09-29T10:09:28.351+08:00 2021-09-29T02:09:28.350+0000 I COMMAND [conn63809] command tr.Client command: find { find: “Client”, filter: { applicationId: 5836984, clientId: “139226756”, clientType: “LegacyQualifiedWorker” }, limit: 2, $db: “tr”, $clusterTime: { clusterTime: Timestamp(1632881348, 2), signature: { hash: BinData(0, EE561BD26B273CFB69233B17D400D94C165D433B), keyId: 6981363876726046721 } }, lsid: { id: UUID(“100a0796-a4f2-40d7-b70c-6d2aeb9766be”) }, $readPreference: { mode: “primaryPreferred” } } numYields:0 ok:0 errMsg:“Encountered non-retryable errorduring query :: caused by :: Couldn’t get a connection within the time limit” errName:NetworkInterfaceExceededTimeLimit errCode:202 reslen:342 protocol:op_msg 20067ms

It showed a lot of NetworkInterfaceExceededTimeLimit error.

I connected to mongo-rs-1 and rs.status() showed that all members of replicaSet were online, one PRIMARY and two SECONDARYs.

I restarted mongo-1 and it recovered for a while , then it went wrong again. About 20 minutes later I restarted mongo-1 again and it revovered completely this time, everything worked well, no more error.

Was it a network problem or something else? I have no idea what happened in mongo-1. Could anyone gives some advice?

Hi @Yi_Liang and welcome in the MongoDB Community :muscle: !

Humm looks like something was wrong in the Network indeed. But I’m wondering if you have enough RAM on this system? Each connections required about 1MB of RAM. If you don’t have enough, this might create this kind of unexpected behaviour I think.

That’s the only idea I have :confused:. I’m glad it’s back in shape now.
Can you share some information about your sizing maybe? Like total DB size, total index sizes, RAM, data distribution, …

Cheers,
Maxime.

Thank you

Some information of servers and db:
mongo-1: 64-core , 128 GB RAM
mongo-rs-1: 16-core , 32 GB RAM
mongo-rs-sec-1: 16-core , 32 GB RAM

mongo-1 memory used:

db.stats():

{
	"raw" : {
		"shardRS0/mongo-0:27018,mongo-rs-0:27018,mongo-rs-sec-0:27018" : {
			"db" : "turnright",
			"collections" : 52,
			"views" : 0,
			"objects" : 210884012,
			"avgObjSize" : 5346.779377935963,
			"dataSize" : 1127550286498,
			"storageSize" : 247213907968,
			"numExtents" : 0,
			"indexes" : 225,
			"indexSize" : 42863820800,
			"scaleFactor" : 1,
			"fsUsedSize" : 333694676992,
			"fsTotalSize" : 528311816192,
			"ok" : 1
		},
		"shardRS2/mongo-2:27018,mongo-rs-2:27018,mongo-rs-sec-2:27018" : {
			"db" : "turnright",
			"collections" : 53,
			"views" : 0,
			"objects" : 153948239,
			"avgObjSize" : 7035.0821376592685,
			"dataSize" : 1083038506313,
			"storageSize" : 240862732288,
			"numExtents" : 0,
			"indexes" : 228,
			"indexSize" : 39764934656,
			"scaleFactor" : 1,
			"fsUsedSize" : 324289576960,
			"fsTotalSize" : 528311816192,
			"ok" : 1
		},
		"shardRS1/mongo-1:27018,mongo-rs-1:27018,mongo-rs-sec-1:27018" : {
			"db" : "turnright",
			"collections" : 94,
			"views" : 0,
			"objects" : 167557765,
			"avgObjSize" : 6594.570785054337,
			"dataSize" : 1104971541878,
			"storageSize" : 245721223168,
			"numExtents" : 0,
			"indexes" : 307,
			"indexSize" : 41785835520,
			"scaleFactor" : 1,
			"fsUsedSize" : 332308983808,
			"fsTotalSize" : 528311816192,
			"ok" : 1
		}
	},
	"objects" : 532390016,
	"avgObjSize" : 6227.178557621562,
	"dataSize" : 3315560334689,
	"storageSize" : 733797863424,
	"numExtents" : 0,
	"indexes" : 760,
	"indexSize" : 124414590976,
	"scaleFactor" : 1,
	"fileSize" : 0,
	"ok" : 1,
	"operationTime" : Timestamp(1634175643, 14),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1634175643, 14),
		"signature" : {
			"hash" : BinData(0,"gBtl/mH5w+1a3LzyX7cpprxQuJM="),
			"keyId" : 7007827695677472770
		}
	}
}

All the members of a RS should be identical. It’s a good practice to ensure that you are actually HA and that any node can handle the workload.

Also, remember all the nodes are doing the write operations (using & maintaining indexes, etc). The secondaries work almost as much as the primary - especially if you are using readPreference options like nearest or SecondaryPreferred.

If the secondaries start to lag being the primary because they can’t keep up, this could be a big problem for the cluster because you won’t be able to use readConcern or writeConcern “majority” and this will start building cache pressure on the primary as the majority commit point will lag behind.

Do you have a single DB? Look like the DB turnright already has indexes for about 39 to 42 GB. Given that indexes need to be in memory, that’s already too much to handle for the secondaries with less RAM.

RAM = OS RAM + Connections + Indexes + Working Set + extra RAM for queries (aggregation, in memory sort, everything else). Looks like your “main” node has enough for the quantity of data I see here but the secondaries are too small.

I’m assuming that mongo-0, mongo-1 and mongo-2 are config with priority = 2 so they have a greater chance to be primary than the 2 other nodes.

I’m not sure if it’s possible or not but maybe there isn’t enough RAM on the secondaries to accept new connections from the primary when something needs to be done (replication, etc) as the secondaries are overloaded.

Cheers,
Maxime.

1 Like

I found something.
I searched the logs with keyword ‘command’, and before 02:08 , mongo-1 printed ‘command’ logs normally , such as :

After 02:08 no more ‘command’ logs printed until 02:12 , the logs showed that all the operations were interrupted:

2021-09-29T02:12:17.831+0000 I COMMAND [conn46920] command local.$cmd command: find { find: "oplog.rs", limit: 1, sort: { $natural: 1 }, projection: { ts: 1, t: 1 }, $readPreference: { mode: "secondaryPreferred" }, $clusterTime: { clusterTime: Timestamp(1632881507, 5), signature: { hash: BinData(0, E6081E7F2BCB714FEFCDFDC93D95F32DCBB8F924), keyId: 6981363876726046721 } }, $db: "local" } numYields:0 ok:0 errMsg:"operation was interrupted because a client disconnected" errName:ClientDisconnect errCode:279 reslen:422 locks:{ ReplicationStateTransition: { acquireCount: { w: 1 } } } protocol:op_msg 29999ms
...
2021-09-29T02:12:45.478+0000 I  COMMAND  [conn48365] command turnright.$cmd command: find { find: "Legacy", filter: { $and: [ { _id  *********  ), t: 40 } }, $db: "turnright" } numYields:0 ok:0 errMsg:"operation was interrupted because a client disconnected" errName:ClientDisconnect errCode:279 reslen:422 locks:{ ReplicationStateTransition: { acquireCount: { w: 1 } } } protocol:op_msg 15999ms
...
2021-09-29T02:12:45.818+0000 I  COMMAND  [conn44043] command turnright.$cmd command: find { find: "Identity", filter: { legacyId: 134089364 **********} numYields:0 ok:0 errMsg:"operation was interrupted because a client disconnected" errName:ClientDisconnect errCode:279 reslen:422 locks:{ ReplicationStateTransition: { acquireCount: { w: 1 } } } protocol:op_msg 165999ms

Unable to acquire lock:

2021-09-29T02:09:16.733Z	command turnright.Bill command: find { find: "Bill", filter: { orderId: 12766990}, limit: 1, singleBatch: true, $db: "turnright", $clusterTime: { clusterTime: Timestamp(1632881337, 8), signature: { hash: BinData(0, B29AC9F8A2F54C96E1B7AAE0671F7A2DA2740E2C), keyId: 6981363876726046721 } }, lsid: { id: UUID("915c8453-1e57-448f-8ff5-e89cf281709b") }, txnNumber: 900, startTransaction: true, autocommit: false } numYields:0 ok:0 errMsg:"Encountered non-retryable error during query :: caused by :: Unable to acquire lock '{9934848871295096691: Database, 711476834440320883, turnright}' within 5ms' milliseconds." errName:LockTimeout errCode:24 reslen:437 protocol:op_msg 19422ms
2021-09-29T02:09:25.686Z	command turnright.Bill command: find { find: "Bill", filter: { orderId: 12766990}, limit: 1, singleBatch: true, $db: "turnright", $clusterTime: { clusterTime: Timestamp(1632881337, 7), signature: { hash: BinData(0, B29AC9F8A2F54C96E1B7AAE0671F7A2DA2740E2C), keyId: 6981363876726046721 } }, lsid: { id: UUID("08e68564-7c10-4b36-9fe3-8b78e361d303") }, txnNumber: 516, startTransaction: true, autocommit: false } numYields:0 ok:0 errMsg:"Encountered non-retryable error during query :: caused by :: Unable to acquire lock '{9934848871295096691: Database, 711476834440320883, turnright}' within 5ms' milliseconds." errName:LockTimeout errCode:24 reslen:437 protocol:op_msg 28376ms

It seems that all the operations were blocked by something until clients couldn’t wait anymore and disconnected. Could it be a lock or something else?