Slow chunk movements when there are already MANY chunks in config database

Rob_De_Langhe · December 16, 2020, 3:02pm

I notice that many of these ‘moveChunk’ commands have been kind of queued, but not executed:

the “sh.status()” command shows that the chunk movements are not done for all shards (some shards are missing chunks, the default shard has excess chunks)
any next attempt to run “moveChunk” is happily ignored, as it seems: none of them cause any change in the “sh.status()” overview
the logs of the ‘mongos’ router shows many “moveChunk” commands that are still processed altough we are no longer initiating them via “adminCommand”; these logs seem to indicate some long-running queued moveChunk commands that are now aborted with some “cursorExhausted=true” message:

Log sample:

{"t":{"$date":"2020-12-16T15:49:21.492+01:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn1736393","msg":"Slow query","attr": {"type":"command","ns":"config.changelog","appName":"MongoDB Shell","command":{"aggregate":"changelog","pipeline":[{"$match":{"time":{"$gt":{"$date":"2020-12-15T14:49:20.088Z"}}, 
"what":"moveChunk.from",
"$or":[{"details.errmsg":{"$exists":true}},{"details.note":{"$ne":"success"}}]}},{"$group":{"_id":{"msg":"$details.errmsg","from":"$details.from","to":"$details.to"},"count":{"$sum":1.0}}},{"$project": {"_id":{"$ifNull":
["$_id.msg","aborted"]},
"from":"$_id.from","to":"$_id.to","count":"$count"}}],"cursor":{},"lsid":{"id":{"$uuid":"e1c5cc31-476e-44bb-be55-db1687d3b7e4"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1608130173,"i":28}},"signature":{"hash":{"$binary":{"base64":"DYpcel5ZTCQ5+Sn+iFO/V5Pvz8g=","subType":"0"}},"keyId":6880911790037794833}},"$db":"config"},"nShards":1,
cursorExhausted":true,"numYields":0,
"nreturned":0,"reslen":230,"protocol":"op_msg","durationMillis":802}}

These logs entries keep on going on, as if there is an awful long queue of chunk-movement attempts that are queued for long time, then aborting for some reason (“cursorExhausted=true” ?)
Any new attempt to move another chunk is terminating within a few millisecs, but nothing is effectively done when I check “sh.status”.
So to me it looks like chunk movements get queued, but that queue is not processed (blocked for some reason?) and finally these attempts are aborted.
=> how to get rid of this queue?
=> how to find the reason why the moveChunk commands don’t get executed?

rgds
Rob