Production Environment Faults：failed to get connections from primary

Yi_Liang · October 12, 2021, 3:35am

We run 3 shards of each 3 “mongod” servers each：

MongoDB version: 4.2.10
Docker version: 19.03.8, build afacb8b7f0
OS: Ubuntu 18.04.4

Everything worked well but some days ago, a failure occurred: witout any sign , mongo-1(primay) could’t return connections to clients . The logs of mongos and Secondary printed lots of “NetworkInterfaceExceededTimeLimit” exception.

The logs of mongo-1(primary), during the problem, it only printed logs about connection:

2021-09-29T10:08:55.921+08:00 2021-09-29T02:08:55.921+0000 I NETWORK [listener] connection accepted from ip:57192 #42365 (178 connections now open)
2021-09-29T10:08:55.968+08:00 2021-09-29T02:08:55.968+0000 I NETWORK [listener] connection accepted from ip:36974 #42366 (179 connections now open)
2021-09-29T10:08:55.968+08:00 2021-09-29T02:08:55.968+0000 I ACCESS [conn42359] Successfully authenticated as principal __system on local from client ip:36962
2021-09-29T10:08:55.969+08:00 2021-09-29T02:08:55.968+0000 I NETWORK [listener] connection accepted from ip:36976 #42367 (180 connections now open)
2021-09-29T10:08:55.969+08:00 2021-09-29T02:08:55.969+0000 I NETWORK [conn42366] received client metadata from ip:36974 conn42366: { driver: { name: “NetworkInterfaceTL”, version: “4.2.10” }, os: { type: “Linux”, name: “Ubuntu”, architecture: “x86_64”, version: “18.04” } }

2021-09-29T10:21:05.976+08:00 2021-09-29T02:21:05.976+0000 I NETWORK [listener] connection accepted from ip:41604 #59742 (16272 connections now open)
2021-09-29T10:21:05.976+08:00 2021-09-29T02:21:05.976+0000 I NETWORK [listener] connection accepted from ip:41602 #59741 (16271 connections now open)
2021-09-29T10:21:05.976+08:00 2021-09-29T02:21:05.976+0000 I NETWORK [conn59741] received client metadata from ip:41602 conn59741: { driver: { name: “NetworkInterfaceTL”, version: “4.2.10” }, os: { type: “Linux”, name: “Ubuntu”, architecture: “x86_64”, version: “18.04” } }

It kept receiving connections form clients but could’t handle them.
I tried to connect to mongod with MongoDB shell by ‘docker exec -it mongodb-shard mongo --port 27018’ but it doesn’t work, could’t connect to it.

Logs of mongo-rs-1 and mongo-rs-sec-1 (secondary) :

2021-09-29T10:09:05.217+08:00 2021-09-29T02:09:05.217+0000 I REPL [replication-1505] Restarting oplog query due to error: NetworkInterfaceExceededTimeLimit: error in fetcher batch callback :: caused by :: Request 30396207 timed out, deadline was 2021-09-29T02:09:05.207+0000, op was RemoteCommand 30396207 – target:[mongo-1:27018] db:local expDate:2021-09-29T02:09:05.207+0000 cmd:{ getMore: 6531183952488528211, collection: “oplog.rs”, batchSize: 13981010, maxTimeMS: 5000, term: 81, lastKnownCommittedOpTime: { ts: Timestamp(1632881335, 6), t: 81 } }. Last fetched optime: { ts: Timestamp(1632881335, 6), t: 81 }. Restarts remaining: 1
2021-09-29T10:09:07.818+08:00 2021-09-29T02:09:07.818+0000 I REPL [replication-1505] Error returned from oplog query (no more query restarts left): MaxTimeMSExpired: error in fetcher batch callback :: caused by :: operation exceeded time limit
2021-09-29T10:09:07.820+08:00 2021-09-29T02:09:07.820+0000 W REPL [rsBackgroundSync] Fetcher stopped querying remote oplog with error: MaxTimeMSExpired: error in fetcher batch callback :: caused by :: operation exceeded time limit
2021-09-29T10:09:10.317+08:00 2021-09-29T02:09:10.317+0000 I REPL [SyncSourceFeedback] SyncSourceFeedback error sending update to mongo-1:27018: InvalidSyncSource: Sync source was cleared. Was mongo-1:27018
2021-09-29T10:09:37.822+08:00 2021-09-29T02:09:37.822+0000 I REPL [replication-1504] Blacklisting mongo-1:27018 due to error: ‘NetworkInterfaceExceededTimeLimit: Request 30396224 timed out, deadline was 2021-09-29T02:09:37.822+0000, op was RemoteCommand 30396224 – target:[mongo-1:27018] db:local expDate:2021-09-29T02:09:37.822+0000 cmd:{ find: “oplog.rs”, limit: 1, sort: { $natural: 1 }, projection: { ts: 1, t: 1 } }’ for 10s until: 2021-09-29T02:09:47.822+0000

Logs of mongos:

2021-09-29T10:09:28.351+08:00 2021-09-29T02:09:28.350+0000 I COMMAND [conn63809] command tr.Client command: find { find: “Client”, filter: { applicationId: 5836984, clientId: “139226756”, clientType: “LegacyQualifiedWorker” }, limit: 2, $db: “tr”, $clusterTime: { clusterTime: Timestamp(1632881348, 2), signature: { hash: BinData(0, EE561BD26B273CFB69233B17D400D94C165D433B), keyId: 6981363876726046721 } }, lsid: { id: UUID(“100a0796-a4f2-40d7-b70c-6d2aeb9766be”) }, $readPreference: { mode: “primaryPreferred” } } numYields:0 ok:0 errMsg:“Encountered non-retryable errorduring query :: caused by :: Couldn’t get a connection within the time limit” errName:NetworkInterfaceExceededTimeLimit errCode:202 reslen:342 protocol:op_msg 20067ms

It showed a lot of NetworkInterfaceExceededTimeLimit error.

I connected to mongo-rs-1 and rs.status() showed that all members of replicaSet were online, one PRIMARY and two SECONDARYs.

I restarted mongo-1 and it recovered for a while , then it went wrong again. About 20 minutes later I restarted mongo-1 again and it revovered completely this time, everything worked well, no more error.

Was it a network problem or something else? I have no idea what happened in mongo-1. Could anyone gives some advice?

MaBeuLux88 · October 13, 2021, 2:53pm

Hi @Yi_Liang and welcome in the MongoDB Community !

Humm looks like something was wrong in the Network indeed. But I’m wondering if you have enough RAM on this system? Each connections required about 1MB of RAM. If you don’t have enough, this might create this kind of unexpected behaviour I think.

That’s the only idea I have . I’m glad it’s back in shape now.
Can you share some information about your sizing maybe? Like total DB size, total index sizes, RAM, data distribution, …

Cheers,
Maxime.

Yi_Liang · October 14, 2021, 5:17am

Thank you

Some information of servers and db:
mongo-1: 64-core , 128 GB RAM
mongo-rs-1: 16-core , 32 GB RAM
mongo-rs-sec-1: 16-core , 32 GB RAM

mongo-1 memory used:

db.stats():

{
	"raw" : {
		"shardRS0/mongo-0:27018,mongo-rs-0:27018,mongo-rs-sec-0:27018" : {
			"db" : "turnright",
			"collections" : 52,
			"views" : 0,
			"objects" : 210884012,
			"avgObjSize" : 5346.779377935963,
			"dataSize" : 1127550286498,
			"storageSize" : 247213907968,
			"numExtents" : 0,
			"indexes" : 225,
			"indexSize" : 42863820800,
			"scaleFactor" : 1,
			"fsUsedSize" : 333694676992,
			"fsTotalSize" : 528311816192,
			"ok" : 1
		},
		"shardRS2/mongo-2:27018,mongo-rs-2:27018,mongo-rs-sec-2:27018" : {
			"db" : "turnright",
			"collections" : 53,
			"views" : 0,
			"objects" : 153948239,
			"avgObjSize" : 7035.0821376592685,
			"dataSize" : 1083038506313,
			"storageSize" : 240862732288,
			"numExtents" : 0,
			"indexes" : 228,
			"indexSize" : 39764934656,
			"scaleFactor" : 1,
			"fsUsedSize" : 324289576960,
			"fsTotalSize" : 528311816192,
			"ok" : 1
		},
		"shardRS1/mongo-1:27018,mongo-rs-1:27018,mongo-rs-sec-1:27018" : {
			"db" : "turnright",
			"collections" : 94,
			"views" : 0,
			"objects" : 167557765,
			"avgObjSize" : 6594.570785054337,
			"dataSize" : 1104971541878,
			"storageSize" : 245721223168,
			"numExtents" : 0,
			"indexes" : 307,
			"indexSize" : 41785835520,
			"scaleFactor" : 1,
			"fsUsedSize" : 332308983808,
			"fsTotalSize" : 528311816192,
			"ok" : 1
		}
	},
	"objects" : 532390016,
	"avgObjSize" : 6227.178557621562,
	"dataSize" : 3315560334689,
	"storageSize" : 733797863424,
	"numExtents" : 0,
	"indexes" : 760,
	"indexSize" : 124414590976,
	"scaleFactor" : 1,
	"fileSize" : 0,
	"ok" : 1,
	"operationTime" : Timestamp(1634175643, 14),
	"$clusterTime" : {
		"clusterTime" : Timestamp(1634175643, 14),
		"signature" : {
			"hash" : BinData(0,"gBtl/mH5w+1a3LzyX7cpprxQuJM="),
			"keyId" : 7007827695677472770
		}
	}
}

MaBeuLux88 · October 14, 2021, 12:15pm

All the members of a RS should be identical. It’s a good practice to ensure that you are actually HA and that any node can handle the workload.

Also, remember all the nodes are doing the write operations (using & maintaining indexes, etc). The secondaries work almost as much as the primary - especially if you are using readPreference options like nearest or SecondaryPreferred.

If the secondaries start to lag being the primary because they can’t keep up, this could be a big problem for the cluster because you won’t be able to use readConcern or writeConcern “majority” and this will start building cache pressure on the primary as the majority commit point will lag behind.

Do you have a single DB? Look like the DB turnright already has indexes for about 39 to 42 GB. Given that indexes need to be in memory, that’s already too much to handle for the secondaries with less RAM.

RAM = OS RAM + Connections + Indexes + Working Set + extra RAM for queries (aggregation, in memory sort, everything else). Looks like your “main” node has enough for the quantity of data I see here but the secondaries are too small.

I’m assuming that mongo-0, mongo-1 and mongo-2 are config with priority = 2 so they have a greater chance to be primary than the 2 other nodes.

I’m not sure if it’s possible or not but maybe there isn’t enough RAM on the secondaries to accept new connections from the primary when something needs to be done (replication, etc) as the secondaries are overloaded.

Cheers,
Maxime.

Yi_Liang · October 21, 2021, 10:15am

I found something.
I searched the logs with keyword ‘command’, and before 02:08 , mongo-1 printed ‘command’ logs normally , such as :

After 02:08 no more ‘command’ logs printed until 02:12 , the logs showed that all the operations were interrupted:

2021-09-29T02:12:17.831+0000 I COMMAND [conn46920] command local.$cmd command: find { find: "oplog.rs", limit: 1, sort: { $natural: 1 }, projection: { ts: 1, t: 1 }, $readPreference: { mode: "secondaryPreferred" }, $clusterTime: { clusterTime: Timestamp(1632881507, 5), signature: { hash: BinData(0, E6081E7F2BCB714FEFCDFDC93D95F32DCBB8F924), keyId: 6981363876726046721 } }, $db: "local" } numYields:0 ok:0 errMsg:"operation was interrupted because a client disconnected" errName:ClientDisconnect errCode:279 reslen:422 locks:{ ReplicationStateTransition: { acquireCount: { w: 1 } } } protocol:op_msg 29999ms
...
2021-09-29T02:12:45.478+0000 I  COMMAND  [conn48365] command turnright.$cmd command: find { find: "Legacy", filter: { $and: [ { _id  *********  ), t: 40 } }, $db: "turnright" } numYields:0 ok:0 errMsg:"operation was interrupted because a client disconnected" errName:ClientDisconnect errCode:279 reslen:422 locks:{ ReplicationStateTransition: { acquireCount: { w: 1 } } } protocol:op_msg 15999ms
...
2021-09-29T02:12:45.818+0000 I  COMMAND  [conn44043] command turnright.$cmd command: find { find: "Identity", filter: { legacyId: 134089364 **********} numYields:0 ok:0 errMsg:"operation was interrupted because a client disconnected" errName:ClientDisconnect errCode:279 reslen:422 locks:{ ReplicationStateTransition: { acquireCount: { w: 1 } } } protocol:op_msg 165999ms

Unable to acquire lock:

2021-09-29T02:09:16.733Z	command turnright.Bill command: find { find: "Bill", filter: { orderId: 12766990}, limit: 1, singleBatch: true, $db: "turnright", $clusterTime: { clusterTime: Timestamp(1632881337, 8), signature: { hash: BinData(0, B29AC9F8A2F54C96E1B7AAE0671F7A2DA2740E2C), keyId: 6981363876726046721 } }, lsid: { id: UUID("915c8453-1e57-448f-8ff5-e89cf281709b") }, txnNumber: 900, startTransaction: true, autocommit: false } numYields:0 ok:0 errMsg:"Encountered non-retryable error during query :: caused by :: Unable to acquire lock '{9934848871295096691: Database, 711476834440320883, turnright}' within 5ms' milliseconds." errName:LockTimeout errCode:24 reslen:437 protocol:op_msg 19422ms
2021-09-29T02:09:25.686Z	command turnright.Bill command: find { find: "Bill", filter: { orderId: 12766990}, limit: 1, singleBatch: true, $db: "turnright", $clusterTime: { clusterTime: Timestamp(1632881337, 7), signature: { hash: BinData(0, B29AC9F8A2F54C96E1B7AAE0671F7A2DA2740E2C), keyId: 6981363876726046721 } }, lsid: { id: UUID("08e68564-7c10-4b36-9fe3-8b78e361d303") }, txnNumber: 516, startTransaction: true, autocommit: false } numYields:0 ok:0 errMsg:"Encountered non-retryable error during query :: caused by :: Unable to acquire lock '{9934848871295096691: Database, 711476834440320883, turnright}' within 5ms' milliseconds." errName:LockTimeout errCode:24 reslen:437 protocol:op_msg 28376ms

It seems that all the operations were blocked by something until clients couldn’t wait anymore and disconnected. Could it be a lock or something else?

Debalina_Saha · April 4, 2022, 1:26pm

We faced a very similar issue where VMs were accessible. But MongoDb connection was failing and no application was able to connect during that duration. We had to restart one of the VM shard. Though the issue resolved after an hour , we are trying to investigate the issue.

After checking we found this “NetworkInterfaceExceededTimeLimit” error in mongos log, which I believe has been taken cared in Mongo 4.0 version. We are using Mongo version 4.4.7 currently. Also to mention , we have enough RAM : 400 GB RAM

1.Error message received in application:
A.
com.mongodb.MongoBulkWriteException:Bulk write operation error on server . Write errors: [BulkWriteError{index=0, code=202, message=‘Write results unavailable from :: caused by :: Couldn’t get a connection within the time limit of 181ms’, details={ }}].

B.
'message ‘Encountered non-retryable error during query :: caused by :: Couldn’t get a connection within the time limit of 650ms’ on server

2.Error message in Mongos logs:

{
“s”: “I”,
“c”: “NETWORK”,
“id”: 4712102,
“ctx”: “conn11471844”,
“msg”: “Host failed in replica set”,
“attr”: {
“replicaSet”: “rs1”,
“host”: “host:port”,
“error”:

{ “code”: 202, “codeName”: “NetworkInterfaceExceededTimeLimit”, “errmsg”: “Couldn’t get a connection within the time limit of 16ms” }
,
“action”: {
“dropConnections”: false,
“requestImmediateCheck”: false,
“outcome”:

{ “host”: “host:port”, “success”: false, “errorMessage”: “NetworkInterfaceExceededTimeLimit: Couldn’t get a connection within the time limit of 16ms” }
}
}
}

MaBeuLux88 · April 12, 2022, 4:51pm

This could also be due to a misconfiguration of the host. For linux, the first thing that comes to mind is the number of open files.

More information here:

I would also recommend to check the Production Notes to make sure that all the recommendations where correctly implemented:

Cheers,
Maxime.