Unable to acquire X lock on Collection When MoveChunks

finisky · September 29, 2021, 5:12am

A MongoDB sharded cluster v4.4.6-ent with 5 shards, the balancer movechunks failed.

sh.status() result:

  balancer:
        Currently enabled:  no
        Currently running:  no
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours:
                7 : Failed with error 'aborted', from mongo-1 to mongo-3
                7208 : Failed with error 'aborted', from mongo-1 to mongo-4
  databases:
        {  "_id" : "XX",  "primary" : "mongo-1",  "partitioned" : true,  "version" : {  "uuid" : UUID("b5bccab7-b960-47ed-81c1-d72a7f90dd21"),  "lastMod" : 1 } }
                X.A
                        shard key: { "Uuid" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                mongo-0       231
                                mongo-1       327
                                mongo-2       230
                                mongo-3       208
                        too many chunks to print, use verbose if you want to force print

Obviously, the chunks is unbalanced for each shard. Then I found many movechunk logs (sensitive infomation masked):

{
    "t": {
        "$date": "2021-09-26T20:05:19.443+00:00"
    },
    "s": "I",
    "c": "COMMAND",
    "id": 51803,
    "ctx": "conn660488",
    "msg": "Slow query",
    "attr": {
        "type": "command",
        "ns": "admin.$cmd",
        "command": {
            "moveChunk": "X.A",
            "shardVersion": [{
                    "$timestamp": {
                        "t": 514,
                        "i": 0
                    }
                }, {
                    "$oid": "5ff83e4ba85a6bd465831542"
                }
            ],
            "epoch": {
                "$oid": "5ff83e4ba85a6bd465831542"
            },
            "configdb": "",
            "fromShard": "mongo-1",
            "toShard": "mongo-3",
            "min": {
                "Uuid": {
                    "$minKey": 1
                }
            },
            "max": {
                "Uuid": "XXX"
            },
            "maxChunkSizeBytes": 67108864,
            "waitForDelete": false,
            "forceJumbo": 0,
            "takeDistLock": false,
            "writeConcern": {},
            "$clusterTime": {
                "clusterTime": {
                    "$timestamp": {
                        "t": 1632686717,
                        "i": 2
                    }
                },
                "signature": {
                    "hash": {
                        "$binary": {
                            "base64": "=",
                            "subType": "0"
                        }
                    },
                    "keyId": 6952576454697680898
                }
            },
            "$configServerState": {
                "opTime": {
                    "ts": {
                        "$timestamp": {
                            "t": 1632686717,
                            "i": 2
                        }
                    },
                    "t": 34
                }
            },
            "$db": "admin"
        },
        "numYields": 0,
        "ok": 0,
        "errMsg": "Unable to acquire X lock on '{12443225184803746029: Collection, 914010138735276269, X.A}' within 500ms. opId: 662866950, op: MoveChunk, connId: 0.",
        "errName": "LockTimeout",
        "errCode": 24,
        "reslen": 546,
        "locks": {},
        "protocol": "op_msg",
        "durationMillis": 2392
    }
}

Because the above error happens for the first chunk, the whole balancing process is blocked by this error which makes shard mongo-1 holds much more data than other shards.

Check the chunk size, NO jumbo chunks found.

I try to movechunks manually, but got a similar error message:

MongoDB Enterprise mongos> sh.moveChunk("X.A", {"Uuid": "XX"}, "mongo-3" )
{
        "ok" : 0,
        "errmsg" : "Unable to acquire X lock on '{13328793763114131844: Collection, 1799578717045662084, X.A}' within 500ms. opId: 719103624, op: MoveChunk, connId: 0.",
        "code" : 24,
        "codeName" : "LockTimeout",
        "operationTime" : Timestamp(1632890816, 39),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1632890816, 39),
                "signature" : {
                        "hash" : BinData(0,"/="),
                        "keyId" : NumberLong("6952576454697680898")
                }
        }
}

How to resolve the LockTimeout issue above?

Baskar_Lingam · April 1, 2022, 7:43pm

Was this ever resolved?

finisky · April 2, 2022, 2:59am

Just restart the error pod to switch primary can solve this issue.

Detailed explanation here:

Baskar_Lingam · April 2, 2022, 6:20am

@finisky : Yes i had checked out the link you have given and even tried restarting source shard. I restarted the entire sharded cluster per https://www.mongodb.com/docs/v4.4/tutorial/restart-sharded-cluster/ and the issue was still there.

Restarted only the source shard and no luck.

We have only 1 node in each shard without any secondary (no replication).

Even though this is randomly stuck at some migrations, it continues after a while. It is just that a lot of time is wasted on this. If I am not wrong the chunk migration failures will not have any data loss. If no data loss then fine.

But i guess this has to be fixed or the root cause be analyzed.

finisky · April 2, 2022, 9:58am

To my understanding, chunk migration failure won’t lead to data loss.

Unfortunately, currently we don’t find the root cause. Maybe you can submit a bug to MongoDB Issue Tracker if you could reproduce the issue.