Unable to acquire X lock on Collection When MoveChunks

A MongoDB sharded cluster v4.4.6-ent with 5 shards, the balancer movechunks failed.

sh.status() result:

  balancer:
        Currently enabled:  no
        Currently running:  no
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours:
                7 : Failed with error 'aborted', from mongo-1 to mongo-3
                7208 : Failed with error 'aborted', from mongo-1 to mongo-4
  databases:
        {  "_id" : "XX",  "primary" : "mongo-1",  "partitioned" : true,  "version" : {  "uuid" : UUID("b5bccab7-b960-47ed-81c1-d72a7f90dd21"),  "lastMod" : 1 } }
                X.A
                        shard key: { "Uuid" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                mongo-0       231
                                mongo-1       327
                                mongo-2       230
                                mongo-3       208
                        too many chunks to print, use verbose if you want to force print

Obviously, the chunks is unbalanced for each shard. Then I found many movechunk logs (sensitive infomation masked):

{
    "t": {
        "$date": "2021-09-26T20:05:19.443+00:00"
    },
    "s": "I",
    "c": "COMMAND",
    "id": 51803,
    "ctx": "conn660488",
    "msg": "Slow query",
    "attr": {
        "type": "command",
        "ns": "admin.$cmd",
        "command": {
            "moveChunk": "X.A",
            "shardVersion": [{
                    "$timestamp": {
                        "t": 514,
                        "i": 0
                    }
                }, {
                    "$oid": "5ff83e4ba85a6bd465831542"
                }
            ],
            "epoch": {
                "$oid": "5ff83e4ba85a6bd465831542"
            },
            "configdb": "",
            "fromShard": "mongo-1",
            "toShard": "mongo-3",
            "min": {
                "Uuid": {
                    "$minKey": 1
                }
            },
            "max": {
                "Uuid": "XXX"
            },
            "maxChunkSizeBytes": 67108864,
            "waitForDelete": false,
            "forceJumbo": 0,
            "takeDistLock": false,
            "writeConcern": {},
            "$clusterTime": {
                "clusterTime": {
                    "$timestamp": {
                        "t": 1632686717,
                        "i": 2
                    }
                },
                "signature": {
                    "hash": {
                        "$binary": {
                            "base64": "=",
                            "subType": "0"
                        }
                    },
                    "keyId": 6952576454697680898
                }
            },
            "$configServerState": {
                "opTime": {
                    "ts": {
                        "$timestamp": {
                            "t": 1632686717,
                            "i": 2
                        }
                    },
                    "t": 34
                }
            },
            "$db": "admin"
        },
        "numYields": 0,
        "ok": 0,
        "errMsg": "Unable to acquire X lock on '{12443225184803746029: Collection, 914010138735276269, X.A}' within 500ms. opId: 662866950, op: MoveChunk, connId: 0.",
        "errName": "LockTimeout",
        "errCode": 24,
        "reslen": 546,
        "locks": {},
        "protocol": "op_msg",
        "durationMillis": 2392
    }
}

Because the above error happens for the first chunk, the whole balancing process is blocked by this error which makes shard mongo-1 holds much more data than other shards.

Check the chunk size, NO jumbo chunks found.

I try to movechunks manually, but got a similar error message:

MongoDB Enterprise mongos> sh.moveChunk("X.A", {"Uuid": "XX"}, "mongo-3" )
{
        "ok" : 0,
        "errmsg" : "Unable to acquire X lock on '{13328793763114131844: Collection, 1799578717045662084, X.A}' within 500ms. opId: 719103624, op: MoveChunk, connId: 0.",
        "code" : 24,
        "codeName" : "LockTimeout",
        "operationTime" : Timestamp(1632890816, 39),
        "$clusterTime" : {
                "clusterTime" : Timestamp(1632890816, 39),
                "signature" : {
                        "hash" : BinData(0,"/="),
                        "keyId" : NumberLong("6952576454697680898")
                }
        }
}

How to resolve the LockTimeout issue above?

Was this ever resolved?

Just restart the error pod to switch primary can solve this issue.

Detailed explanation here:

@finisky : Yes i had checked out the link you have given and even tried restarting source shard. I restarted the entire sharded cluster per https://www.mongodb.com/docs/v4.4/tutorial/restart-sharded-cluster/ and the issue was still there.

Restarted only the source shard and no luck.

We have only 1 node in each shard without any secondary (no replication).

Even though this is randomly stuck at some migrations, it continues after a while. It is just that a lot of time is wasted on this. If I am not wrong the chunk migration failures will not have any data loss. If no data loss then fine.

But i guess this has to be fixed or the root cause be analyzed.

To my understanding, chunk migration failure won’t lead to data loss.

Unfortunately, currently we don’t find the root cause. Maybe you can submit a bug to MongoDB Issue Tracker if you could reproduce the issue.