Primary MongoD hit 1M active sessions in cache

Chao_Huang · January 4, 2024, 3:41am

Overview

We’re running a sharded MongoDB cluster, using community edition v4.4.

The primary mongod in one of the data shards (‘shardX’ for short) has hit active sessions in cache limit of 1M and new sessions cannot be connected. The secondary mongod in ‘shardX’ also have a high number of stale config issues that caused all read requests blocked, the mitigation was to hide the secondaries.

Current Situation

Right now all the secondary mongod in ‘shardX’ are hidden. Only the primary mongod is serving the traffic. We meet two problems.

Unhiding the secondary mongod or move chunk will cause secondary mongod cannot find mongo config servers, get stale config errors, and bring the site down.

Stale config log: "ok":0,"errMsg":"epoch mismatch detected for collection-xxx","errName":"StaleConfig","errCode":13388,

Cannot find config servers log: {"t":{"$date":"2023-12-22T22:17:08.689-08:00"},"s":"I", "c":"SH_REFR", "id":24100, "ctx":"ShardServerCatalogCacheLoader-19","msg":"Error refreshing cached database entry","attr":{"db":"sweeper","durationMillis":20000,"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"nearest\" } for set main-config-server"}}

Once we unhide a secondary mongod and let it serve traffic, this secondary mongod cannot find mongo config servers, and thus throw stale config errors. All the queries (including scatter and gather queries) hit this secondary mongod will fail. As a result, the site is down.

cached active sessions in primary Mongod reach the upper limit: 1 million, and cannot accept any new session. To connect to this shard, we need to connect to it by disableImplicitSessions:

mongosh primary-host-of-shardX --quiet --tls --tlsCertificateKeyFile /etc/mongo_secrets/mongodb_keystore.pem --tlsCAFile /etc/mongo_secrets/root_ca.crt --authenticationMechanism 'MONGODB-X509' --authenticationDatabase '$external' --disableImplicitSessions

All the connections without disabling implicit sessions are rejected.

Solutions We tried

The solutions we have tried:

Add 2 new replicas to the ‘shardX’ shard.
Rolling restart all the mongos.
Restart all the config servers.
Add 2 new mongo config servers.
Remove the backup node for ‘shardX’ shard
Movechunk
Kill all sessions for all the replicas in the topeka shard.
Update outdated config server address in admin.system.version on the primary instance

But none of the above seemed to work.

Supplementary Info

when looking at the mongod.log of the primary mongod, we found many warnings:

{"t":{"$date":"2023-12-27T19:07:09.352-08:00"},"s":"W",  "c":"NETWORK",  "id":23019,   "ctx":"ReplicaSetMonitor-TaskExecutor","msg":"DNS resolution while connecting to peer was slow","attr":{"peer":"hostname-of-shardY:27017","durationMillis":563392569}}

we verified on the primary node of ‘shardX’: DNS service should be OK – by using nslookup and dig of ‘hostname-of-shardY’ and ‘hostname-of-main-config-server’;

AND

We can do mongosh to connect to ‘shardY’ and ‘main-config-server’.

Further Plan and Help

Since this is a critical production system, we’re very concerned if stepDown the primary or restart primary node would cause the whole data shard get offline (e.g. not able to connect config server would result in election failures, etc).

Has anyone meet such issues before? Does anyone have any comments or advice for us to isolate the issue or on the next steps? Any comments/advice are much appreciated!

Kobe_W · January 4, 2024, 6:45am

So all read and write is being served by the primary only? what’s the qps for the host?
based on my understanding, each operation requires a session on server side, so if qps is high enough, you can easily reach 1million active sessions. (sessions have idle timeout).

That being said, it’s possible that there’s a session leak, which is more difficult to find.

So you have confirmed (nslookup/dig) that the primary can successfully connect to any secondary in the same shard and the config servers. But given:

it seems those secondaries are not able to connect to config servers. Did you verify that?

Chao_Huang · January 5, 2024, 3:16am

Hi Kobe,

Thanks for the reply! Please see my comments inline below:

So all read and write is being served by the primary only? what’s the qps for the host?

yes, all read and write is being served by the primary only.

qps is about:

1.7k commands/sec
1.6k getMore/sec
<1k writes/sec

Data Disk Size:
1.36TB

That being said, it’s possible that there’s a session leak, which is more difficult to find.

Right. We did notice that the background job “LogicalSessionCacheReap” keep output the below log to mongod.log:

{"t":{"$date":"2023-12-27T18:19:13.506-08:00"},"s":"I",  "c":"CONTROL",  "id":20712,   "ctx":"LogicalSessionCacheReap","msg":"Sessions collection is not set up; waiting until next sessions reap interval","attr":{"error":"FailedToSatisfyReadPreference: Could not find host matching read preference { mode: \"nearest\" } for set main-config-server"}}

So I suspect that there might be connection issues between ‘primary mongoD of shardX’ and ‘main-config-server’. That’s why I tried on the primary host to do (nslookup/dig) and (mongosh) to connect to config server, and the result was good.

So you have confirmed (nslookup/dig) that the primary can successfully connect to any secondary in the same shard and the config servers.

yes - from primary to config servers.

didn’t verify from primary to secondaries, because replication is OK in the shard.

it seems those secondaries are not able to connect to config servers. Did you verify that?

not verified from secondaries. will try later, and get back soon.

Kobe_W · January 5, 2024, 4:39am

ok, the numbers are indeed low, so high qps may not be the cause.

This is definitely worth a try, according to the information in your post.

From the logging msg, it’s very much like the primary instance is not able to connect to any config server. (check what is nearest). here the “connect” may be from network level and/or application(mongodb protocol) level which i’m not sure.

Chao_Huang · January 8, 2024, 2:40am

Quick update. We ended up force stepDown the problematic primary, and removed it from the shard. Then updated the record in config db (db.shards.update) to reflect the new primary host and port. The issues are gone now.

Thanks for the support, Kobe!