How can I inspect the moveChunk slow performance?

hi,

using MongoDB version 4.4.2 on Ubuntu 18.04, XFS filesystems, 36 bare metal servers:

we have to adopt the mechanism of disabling the balancer permanently, and create new collections (on a weekly basis, for data of that week) then pre-split the initial chunk into 3600 chunks and pre-migrate these 3600 chunks evenly across our 90 shards using the “moveChunk” command. Only when this is completed, we can safely start using such collection.

This rarely goes fast (like 1 chunk movement per second), but typically takes 10-20 secs per chunk, even with a fully idle cluster. So 20secs x 3600 chunks means … 20 hours.

The only information about what’s happening can be found in the logs (examples below are for collection named “cdr_icx_isup_20201220” in dbase “cdrarch”):

  • logs of config server:

{"t":{"$date":"2021-01-28T13:57:52.235+01:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn82106","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"_configsvrMoveChunk":1,"ns":"cdrarch.cdr_icx_isup_20201220","min":{"SHARD_MINSEC":67.0},"max":{"SHARD_MINSEC":68.0},"shard":"db_rs030","lastmod":{"$timestamp":{"t":836,"i":1}},"lastmodEpoch": {"$oid":"6012685e0a3dbbba22c99476"},"toShard":"db_rs068","maxChunkSizeBytes":67108864,"secondaryThrottle":{},"waitForDelete":true,"forceJumbo":false,"writeConcern":{"w":"majority","wtimeout":15000},"lsid":{"id":{"$uuid":"0addce49-3078-42ed-bef6-32016ea86c73"},"uid":{"$binary":{"base64":"YtJ8CVGJPpojGlBhlVfpmkB+TWiGCwPUvkGEjp5tty0=","subType":"0"}}},"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1611838654,"i":5}},"signature":{"hash":{"$binary":{"base64":"mUIAh6c1xe1e5RmLCS2d02ecNgI=","subType":"0"}},"keyId":6921625911045390337}},"$audit":{"$impersonatedUsers":[{"user":"mongo-admin","db":"admin"}],"$impersonatedRoles":[{"role":"root","db":"admin"}]},"$client":{"application":{"name":"MongoDB Shell"},"driver":{"name":"MongoDB Internal Client","version":"4.4.2"},"os":{"type":"Linux","name":"Ubuntu","architecture":"x86_64","version":"18.04"},"mongos":{"host":"ndecar01:30004","client":"10.100.22.4:49376","version":"4.4.2"}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1611838654,"i":4}},"t":41}},"$db":"admin"},"numYields":0,"reslen":537,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":3}},"ReplicationStateTransition":{"acquireCount":{"w":5}},"Global":{"acquireCount":{"r":2,"w":3}},"Database":{"acquireCount":{"r":2,"w":3}},"Collection":{"acquireCount":{"r":2,"w":3}},"Mutex":{"acquireCount":{"r":5}}},"flowControl":{"acquireCount":3,"timeAcquiringMicros":3},"writeConcern":{"w":"majority","wtimeout":15000,"provenance":"clientSupplied"},"storage":{},"protocol":"op_msg","durationMillis":17631}}

(notice the last field “durationMillis” = 17secs)

  • logs on a source shard: I see many successive (repeating) logs about data transfers, in which “docsRemainingToClone” is always zero (as expected because this collection is new and thus still entirely empty, but why these repetitive, useless ‘data transfers’ ??) :
    {"t":{"$date":"2021-01-28T14:34:50.150+01:00"},"s":"I", "c":"SHARDING", "id":21993, "ctx":"MoveChunk","msg":"moveChunk data transfer progress","attr":{"response":{"waited":true,"active":true,"sessionId":"db_rs030_db_rs047_6012bd570a3dbbba22e1ec76","ns":"cdrarch.cdr_icx_isup_20201220","from":"db_rs030/ndecar12:30303,ndecar14:30301,ndemec01:30302","fromShardId":"db_rs030","min":{"SHARD_MINSEC":46.0},"max":{"SHARD_MINSEC":47.0},"shardKeyPattern":{"SHARD_MINSEC":1.0},"state":"ready","counts":{"cloned":0,"clonedBytes":0,"catchup":0,"steady":0},"ok":1.0,"$gleStats":{"lastOpTime":{"ts":{"$timestamp":{"t":1611840856,"i":2}},"t":1},"electionId":{"$oid":"7fffffff0000000000000001"}},"lastCommittedOpTime":{"$timestamp":{"t":1611840856,"i":2}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1611840887,"i":32}},"t":41}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1611840888,"i":62}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1611840888,"i":1}}},"memoryUsedBytes":0,"docsRemainingToClone":0}}

thx in advance for any suggestions !
Rob

An example where the chunk movement went fast (approx 1 sec for the chunk with shard-value range 579-580) :

  • the start at 14:51:30.876 :
    {"t":{"$date":"2021-01-28T14:51:30.876+01:00"},"s":"I", "c":"MIGRATE", "id":22016, "ctx":"MoveChunk","msg":"Starting chunk migration donation","attr":{"requestParameters":"ns: cdrarch.cdr_icx_isup_20201220, [{ SHARD_MINSEC: 579.0 }, { SHARD_MINSEC: 580.0 }), fromShard: db_rs030, toShard: db_rs040","collectionEpoch":{"$oid":"6012685e0a3dbbba22c99476"}}}
  • logging this action at 14:51:30.882 :
    {"t":{"$date":"2021-01-28T14:51:30.882+01:00"},"s":"I", "c":"SHARDING", "id":22080, "ctx":"MoveChunk","msg":"About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"ndecar14:30301-2021-01-28T14:51:30.882+01:00-6012c1620a3dbbba22e39230","server":"ndecar14:30301","shard":"db_rs030","clientAddr":"","time":{"$date":"2021-01-28T13:51:30.882Z"},"what":"moveChunk.start","ns":"cdrarch.cdr_icx_isup_20201220","details":{"min":{"SHARD_MINSEC":579.0},"max":{"SHARD_MINSEC":580.0},"from":"db_rs030","to":"db_rs040"}}}}
  • the data transfer (no docs exist in the collection, so should be done quick) starts at 14:51:31.303 :
    {"t":{"$date":"2021-01-28T14:51:31.303+01:00"},"s":"I", "c":"SHARDING", "id":21993, "ctx":"MoveChunk","msg":"moveChunk data transfer progress","attr":{"response":{"waited":true,"active":true,"sessionId":"db_rs030_db_rs040_6012c1620a3dbbba22e39231","ns":"cdrarch.cdr_icx_isup_20201220","from":"db_rs030/ndecar12:30303,ndecar14:30301,ndemec01:30302","fromShardId":"db_rs030","min":{"SHARD_MINSEC":579.0},"max":{"SHARD_MINSEC":580.0},"shardKeyPattern":{"SHARD_MINSEC":1.0},"state":"steady","counts":{"cloned":0,"clonedBytes":0,"catchup":0,"steady":0},"ok":1.0,"$gleStats":{"lastOpTime":{"ts":{"$timestamp":{"t":1611841888,"i":24}},"t":21},"electionId":{"$oid":"7fffffff0000000000000015"}},"lastCommittedOpTime":{"$timestamp":{"t":1611841891,"i":3}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1611841890,"i":66}},"t":41}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1611841891,"i":3}},"signature":{"hash":{"$binary":{"base64":"AAAAAAAAAAAAAAAAAAAAAAAAAAA=","subType":"0"}},"keyId":0}},"operationTime":{"$timestamp":{"t":1611841891,"i":3}}},"memoryUsedBytes":0,"docsRemainingToClone":0}}
  • only one data transfer attempt, as expected; so delete starts right after it, at 14:51:31.424 :
    {"t":{"$date":"2021-01-28T14:51:31.424+01:00"},"s":"I", "c":"MIGRATE", "id":22026, "ctx":"range-deleter","msg":"Submitting range deletion task","attr":{"deletionTask":{"_id":{"$uuid":"382b9f30-c249-4b6f-8361-e258353eb532"},"nss":"cdrarch.cdr_icx_isup_20201220","collectionUuid":{"$uuid":"ac286098-d15b-4797-aaba-a399c3d7757b"},"donorShardId":"db_rs030","range":{"min":{"SHARD_MINSEC":579.0},"max":{"SHARD_MINSEC":580.0}},"whenToClean":"now"},"migrationId":{"uuid":{"$uuid":"382b9f30-c249-4b6f-8361-e258353eb532"}}}}
  • this seems to be done via a scheduled action, at 14:51:31.424 :
    {"t":{"$date":"2021-01-28T14:51:31.424+01:00"},"s":"I", "c":"MIGRATE", "id":21990, "ctx":"range-deleter","msg":"Scheduling deletion of the collection's specified range","attr":{"namespace":"cdrarch.cdr_icx_isup_20201220","range":"[{ SHARD_MINSEC: 579.0 }, { SHARD_MINSEC: 580.0 })"}}
  • another logging of the action, at 14:51:31.424 :
    {"t":{"$date":"2021-01-28T14:51:31.424+01:00"},"s":"I", "c":"SHARDING", "id":22080, "ctx":"MoveChunk","msg":"About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"ndecar14:30301-2021-01-28T14:51:31.424+01:00-6012c1630a3dbbba22e392be","server":"ndecar14:30301","shard":"db_rs030","clientAddr":"","time":{"$date":"2021-01-28T13:51:31.424Z"},"what":"moveChunk.commit","ns":"cdrarch.cdr_icx_isup_20201220","details":{"min":{"SHARD_MINSEC":579.0},"max":{"SHARD_MINSEC":580.0},"from":"db_rs030","to":"db_rs040","counts":{"cloned":0,"clonedBytes":0,"catchup":0,"steady":0}}}}}
  • some wait time for a commit, at 14:51:31.428 :
    {"t":{"$date":"2021-01-28T14:51:31.428+01:00"},"s":"I", "c":"MIGRATE", "id":22019, "ctx":"MoveChunk","msg":"Waiting for migration cleanup after chunk commit","attr":{"namespace":"cdrarch.cdr_icx_isup_20201220","range":"[{ SHARD_MINSEC: 579.0 }, { SHARD_MINSEC: 580.0 })","migrationId":{"uuid":{"$uuid":"382b9f30-c249-4b6f-8361-e258353eb532"}}}}
  • another log of the metadata change, I guess, at 14:51:31.761 :
    {"t":{"$date":"2021-01-28T14:51:31.761+01:00"},"s":"I", "c":"SHARDING", "id":22080, "ctx":"MoveChunk","msg":"About to log metadata event","attr":{"namespace":"changelog","event":{"_id":"ndecar14:30301-2021-01-28T14:51:31.761+01:00-6012c1630a3dbbba22e392e1","server":"ndecar14:30301","shard":"db_rs030","clientAddr":"","time":{"$date":"2021-01-28T13:51:31.761Z"},"what":"moveChunk.from","ns":"cdrarch.cdr_icx_isup_20201220","details":{"min":{"SHARD_MINSEC":579.0},"max":{"SHARD_MINSEC":580.0},"step 1 of 6":0,"step 2 of 6":6,"step 3 of 6":387,"step 4 of 6":33,"step 5 of 6":31,"step 6 of 6":426,"to":"db_rs040","from":"db_rs030","note":"success"}}}}
  • this query didn’t go so quickly after all, started 14:51:31.770 :
    {"t":{"$date":"2021-01-28T14:51:31.770+01:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn940","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"moveChunk":"cdrarch.cdr_icx_isup_20201220","shardVersion":[{"$timestamp":{"t":1962,"i":1}},{"$oid":"6012685e0a3dbbba22c99476"}],"epoch":{"$oid":"6012685e0a3dbbba22c99476"},"configdb":"configReplSet/ndecar01:30001,ndecar02:30002,ndemec01:30003","fromShard":"db_rs030","toShard":"db_rs040","min":{"SHARD_MINSEC":579.0},"max":{"SHARD_MINSEC":580.0},"maxChunkSizeBytes":67108864,"waitForDelete":true,"forceJumbo":0,"takeDistLock":false,"writeConcern":{},"lsid":{"id":{"$uuid":"0addce49-3078-42ed-bef6-32016ea86c73"},"uid":{"$binary":{"base64":"YtJ8CVGJPpojGlBhlVfpmkB+TWiGCwPUvkGEjp5tty0=","subType":"0"}}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1611841890,"i":65}},"signature":{"hash":{"$binary":{"base64":"qi1oG+Z170E479Tzh38zpe9KbVk=","subType":"0"}},"keyId":6921625911045390337}},"$audit":{"$impersonatedUsers":[{"user":"mongo-admin","db":"admin"}],"$impersonatedRoles":[{"role":"root","db":"admin"}]},"$client":{"application":{"name":"MongoDB Shell"},"driver":{"name":"MongoDB Internal Client","version":"4.4.2"},"os":{"type":"Linux","name":"Ubuntu","architecture":"x86_64","version":"18.04"},"mongos":{"host":"ndecar01:30004","client":"10.100.22.4:49376","version":"4.4.2"}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1611841890,"i":65}},"t":41}},"$db":"admin"},"numYields":0,"reslen":333,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}}},"protocol":"op_msg","durationMillis":899}}

So, the main difference seems to be that this data transfer is happening only once here, whereas in the (unfortunately many) other cases it is attempted tens of times, obviously slowing down the whole process.