ReplicaSet in a sharded cluster locked, DROP BLOCKED

Alexandre_Rico · November 22, 2023, 7:46am

Hello,

I need help to understand a problem on our production sharded cluster.

I have a shard cluster of 12 ReplicaSet.
RS12 is primary for a database (so all non-sharded collections are created inside it)
For some reason, RS12 is locked, meaning that all drop/rename/… operations are timed out and never do anything.
When I’m connected, via mongosh, directly on RS12 primary, the drop action works and the collection is removed.
The problem is when I’m connected through the mongos, on the shard cluster.

The error I get when droping from mongos is :

Mongo Server error (MongoCommandException): Command failed with error 202 (NetworkInterfaceExceededTimeLimit): 'While recovering the dist lock manager for term 35 :: caused by :: Request 111842 timed out, deadline was

I’ve try everything I know, reboot everything, switched primary on RS12, nothing works.

I have those kind of message:
{"t":{"$date":"2023-11-21T17:16:50.378+01:00"},"s":"W", "c":"SHARDING", "id":570180, "ctx":"replSetDistLockPinger","msg":"Error recovering dist lock manager","attr":

{“t”:{“$date”:“2023-11-21T14:37:39.616+01:00”},“s”:“I”, “c”:“NETWORK”, “id”:6006301, “ctx”:“ReplicaSetMonitor-TaskExecutor”,“msg”:“Replica set primary server change detected”,“attr”:{“replicaSet”:“rsserver”,“topologyType”:“ReplicaSetNoPrimary”,“primary”:“Unknown”,“durationMillis”:30000}}

Alexandre_Rico · November 23, 2023, 1:22pm

UPDATE :

We were using 6.0.2 mongodb version, upgrading to 7.0.1 fixed this issue.

system · November 28, 2023, 1:22pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.