Mongo primary stopped accepting connections

We faced an issue in our prod env, where one Primary VM was accessible, but It was not accepting any connections. All application that tried to connect to mongo were failing. Attempt to Mongo login to the primary member was not successful. Manual try to start up MongoDB in the VM was also unsuccessful.
Since, Mongo did not went down completely, No election happened. The problemed VM was showing as “primary” according to rs.status. We had to restart the server and then the issue got resolved. We need to find RCA on this.

We are using Mongo 4.4.7 community version.
And we are having below configuration:
config replicaset - 1 Primary, 2 Secondary
shard1 - 1 Primary, 2 Secondary
shard2 - 1 Primary, 2 Secondary
shard3 - 1 Primary, 2 Secondary
2 query router.

errorMessage":"NetworkInterfaceExceededTimeLimit: Couldn’t get a connection within the time limit .

We checked our query router available connection:
QR1
“current” : 7654,
“available” : 43546,
“totalCreated” : 134309236,
“active” : 2890,
“exhaustIsMaster” : 487,
“exhaustHello” : 229,
“awaitingTopologyChanges” : 716

QR2
“current” : 7746,
“available” : 43454,
“totalCreated” : 134299931,
“active” : 2997,
“exhaustIsMaster” : 487,
“exhaustHello” : 229,
“awaitingTopologyChanges” : 716

Also we checked logs thoroughly , connection was getting accepted till 07:09 UTC. The error “NetworkInterfaceExceededTimeLimit” was not present at 07:11 UTC.

But the error suddenly started exact at 2023-03-29T07:12 utc.

Hi @Debalina_Saha and welcome to MongoDB community forums!!

As mentioned in the MongoDB documentation:

Each sharded cluster must have its own config servers. Do not use the same config servers for different sharded clusters.

It’s important to note that applying administrative operations could potentially have an impact on the performance of the sharded cluster. In this particular case, we would like to suggest considering the use of more than one config server to handle any potential performance impacts that may arise from the deployment. This could help to mitigate any potential issues and ensure that the cluster continues to run smoothly.

Also, the forum post mentions a workaround solution for the similar issue.

If the above recommendations does not work, could you help us with the output for the following:

  1. sh.status() during the timeout.
  2. rs.status() from the primary shard server.
  3. Also could you share details on VM setup and how has the deployment been created.

Regards
Aasawari