A MongoDB sharded cluster v4.4.6-ent with 5 shards, the balancer movechunks failed.
sh.status() result:
balancer:
Currently enabled: no
Currently running: no
Failed balancer rounds in last 5 attempts: 0
Migration Results for the last 24 hours:
7 : Failed with error 'aborted', from mongo-1 to mongo-3
7208 : Failed with error 'aborted', from mongo-1 to mongo-4
databases:
{ "_id" : "XX", "primary" : "mongo-1", "partitioned" : true, "version" : { "uuid" : UUID("b5bccab7-b960-47ed-81c1-d72a7f90dd21"), "lastMod" : 1 } }
X.A
shard key: { "Uuid" : 1 }
unique: false
balancing: true
chunks:
mongo-0 231
mongo-1 327
mongo-2 230
mongo-3 208
too many chunks to print, use verbose if you want to force print
Obviously, the chunks is unbalanced for each shard. Then I found many movechunk logs (sensitive infomation masked):
{
"t": {
"$date": "2021-09-26T20:05:19.443+00:00"
},
"s": "I",
"c": "COMMAND",
"id": 51803,
"ctx": "conn660488",
"msg": "Slow query",
"attr": {
"type": "command",
"ns": "admin.$cmd",
"command": {
"moveChunk": "X.A",
"shardVersion": [{
"$timestamp": {
"t": 514,
"i": 0
}
}, {
"$oid": "5ff83e4ba85a6bd465831542"
}
],
"epoch": {
"$oid": "5ff83e4ba85a6bd465831542"
},
"configdb": "",
"fromShard": "mongo-1",
"toShard": "mongo-3",
"min": {
"Uuid": {
"$minKey": 1
}
},
"max": {
"Uuid": "XXX"
},
"maxChunkSizeBytes": 67108864,
"waitForDelete": false,
"forceJumbo": 0,
"takeDistLock": false,
"writeConcern": {},
"$clusterTime": {
"clusterTime": {
"$timestamp": {
"t": 1632686717,
"i": 2
}
},
"signature": {
"hash": {
"$binary": {
"base64": "=",
"subType": "0"
}
},
"keyId": 6952576454697680898
}
},
"$configServerState": {
"opTime": {
"ts": {
"$timestamp": {
"t": 1632686717,
"i": 2
}
},
"t": 34
}
},
"$db": "admin"
},
"numYields": 0,
"ok": 0,
"errMsg": "Unable to acquire X lock on '{12443225184803746029: Collection, 914010138735276269, X.A}' within 500ms. opId: 662866950, op: MoveChunk, connId: 0.",
"errName": "LockTimeout",
"errCode": 24,
"reslen": 546,
"locks": {},
"protocol": "op_msg",
"durationMillis": 2392
}
}
Because the above error happens for the first chunk, the whole balancing process is blocked by this error which makes shard mongo-1
holds much more data than other shards.
Check the chunk size, NO jumbo chunks found.
I try to movechunks manually, but got a similar error message:
MongoDB Enterprise mongos> sh.moveChunk("X.A", {"Uuid": "XX"}, "mongo-3" )
{
"ok" : 0,
"errmsg" : "Unable to acquire X lock on '{13328793763114131844: Collection, 1799578717045662084, X.A}' within 500ms. opId: 719103624, op: MoveChunk, connId: 0.",
"code" : 24,
"codeName" : "LockTimeout",
"operationTime" : Timestamp(1632890816, 39),
"$clusterTime" : {
"clusterTime" : Timestamp(1632890816, 39),
"signature" : {
"hash" : BinData(0,"/="),
"keyId" : NumberLong("6952576454697680898")
}
}
}
How to resolve the LockTimeout issue above?