Chunk movements stopped going quick

hi gurus,

  • using v4.2.8 and now upgraded to 4.4.0
  • disabled the automatic chunk balancer,
  • we pre-create collections, split them into 3600 chunks, then doing manual chunk movements over 36 shards (each replica sets of 3 servers)

We had a situation where manual chunk movements were started (manually hence) while data loading was busy at the same time. The MongoDB cluster granted to a halt…
So we interrupted the “mongo” client from where the chunk movements were initiated.

Now, with a completely idle cluster (no data loading, nada happening on it), we restarted those chunk movements of an entirely new, thus empty (no docs) collection. And it goes slow ! On chunk moved every 15 mins or so.
And no, the servers underneath are not busy doing other things. It’s dedicated hardware (12 servers), local storage, plenty of RAM. None of their resources are exhausted. Network is 1Gbps NICs, PING roundtrips are sub-msec speeds.

How can we find the cause of those super slow chunk movements ? Sure we can raise the logging verbosity to any desired level, but it contains tons of messages of things going well… So not of great help, unless we know in advance where to look for ‘the’ problem.

Eg messages we get in the “mongod.log” of the primary server of the receiving shard, are this (every second):

{"t":{"$date":"2020-08-27T10:44:02.290+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn1258","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"_recvChunkStatus":"cdrarch.cdr_af_20200830","waitForSteadyOrDone":true,"sessionId":"db_rs002_db_rs035_5f476ff3da305f92f9c94d31","$clusterTime":{"clusterTime":{"$timestamp":{"t":1598517840,"i":2}},"signature":{"hash":{"$binary":{"base64":"ih0J82H8GKLsFMBDFj7Xoa0DJho=","subType":"0"}},"keyId":6821045628173287455}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1598517840,"i":1}},"t":9}},"$db":"admin"},"numYields":0,"reslen":734,"locks":{},"protocol":"op_msg","durationMillis":1001}}
{"t":{"$date":"2020-08-27T10:44:03.294+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn1258","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"_recvChunkStatus":"cdrarch.cdr_af_20200830","waitForSteadyOrDone":true,"sessionId":"db_rs002_db_rs035_5f476ff3da305f92f9c94d31","$clusterTime":{"clusterTime":{"$timestamp":{"t":1598517841,"i":1}},"signature":{"hash":{"$binary":{"base64":"rd++AdPb36F7OEENJuunnvxzLEA=","subType":"0"}},"keyId":6821045628173287455}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1598517841,"i":1}},"t":9}},"$db":"admin"},"numYields":0,"reslen":734,"locks":{},"protocol":"op_msg","durationMillis":1001}}

and so on.
These logs appear because of “slow query”. What is the query then? I see “_recvChunkStatus” which refers to the name of the collection (“cdrarch.cdr_af_20200830”).
The sending replicaset/shard is named “db_rs002”, the receiving replicaset/shard is named “db_rs035”.

Any help to get more visibility on what’s going slow under the hood would be greatly appreciated !

thx in advance !

1 Like

I tried to find any documentation about how the chunk movements can be traced with any verbosity, but this seems nowhere documented.
The chunk migrations (empty chunks, zero documents in them) of newly created collections still go horribly slow, nearly granted to a halt (one migration every 15mins or so).
Still, there is zero other activity on my cluster, we cannot use it until those new collections are ready with their chunks on the right shard servers.
MongoDB server version is 4.4.0

The logs of the router servers show:

{"t":{"$date":"2020-08-31T06:25:09.557+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn357","msg":"Slow query","attr":{"type":"command","ns":"_KEYHOLE_88800.$cmd","command":{"dbStats":1,"lsid":{"id":{"$uuid":"b8998ee1-a8d9-46be-ba84-d0d1c40b19bd"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1598847907,"i":4}},"signature":{"keyId":6821045628173287455,"hash":{"$binary":"base64":"7ks/HUkgj+LvEKYINCIfalrKvVU=","subType":"0"}}}},"$db":"_KEYHOLE_88800"},"numYields":0,"reslen":11324,"protocol":"op_query","durationMillis":1216}}
{"t":{"$date":"2020-08-31T06:25:09.794+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn357","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","command":{"dbStats":1,"lsid":{"id":{"$uuid":"b8998ee1-a8d9-46be-ba84-d0d1c40b19bd"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1598847908,"i":1}},"signature":{"keyId":6821045628173287455,"hash":{"$binary":"base64":"n4DO3e0Xx2Fv0taEA1k7CqkR97U=","subType":"0"}}}},"$db":"admin"},"numYields":0,"reslen":11630,"protocol":"op_query","durationMillis":234}}
{"t":{"$date":"2020-08-31T06:25:12.726+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn357","msg":"Slow query","attr":{"type":"command","ns":"cdrarch.$cmd","command":{"dbStats":1 ,"lsid":{"id":{"$uuid":"b8998ee1-a8d9-46be-ba84-d0d1c40b19bd"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1598847909,"i":3}},"signature":{"keyId":6821045628173287455,"hash":{"$binary":{"base64":"RaCGq/SXVR7wYB7Nb0L/G+R2OfI=","subType":"0"}}}},"$db":"cdrarch"},"numYields":0,"reslen":11718,"protocol":"op_query","durationMillis":2930}}
{"t":{"$date":"2020-08-31T06:25:14.738+02:00"},"s":"I",  "c":"SHARDING", "id":20997,   "ctx":"Uptime-reporter","msg":"Refreshed RWC defaults","attr":{"newDefaults":{}}}
{"t":{"$date":"2020-08-31T06:25:15.631+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn357","msg":"Slow query","attr":{"type":"command","ns":"config.$cmd","command":{"dbStats":1,"lsid":{"id":{"$uuid":"b8998ee1-a8d9-46be-ba84-d0d1c40b19bd"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1598847912,"i":1}},"signature":{"keyId":6821045628173287455,"hash":{"$binary":{"base64":"XJgwP/q5znnKbcpFEEaobUrBeF4=","subType":"0"}}}},"$db":"config"},"numYields":0,"reslen":11666,"protocol":"op_query","durationMillis":2902}}
{"t":{"$date":"2020-08-31T06:25:16.460+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn357","msg":"Slow query","attr":{"type":"command","ns":"_KEYHOLE_88800.$cmd","command":{"dbStats":1,"lsid":{"id":{"$uuid":"b8998ee1-a8d9-46be-ba84-d0d1c40b19bd"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1598847915,"i":2}},"signature":{"keyId":6821045628173287455,"hash":{" $binary":{"base64":"7M14WAvT4WlyqOnFwnuTbCiPLdQ=","subType":"0"}}}},"$db":"_KEYHOLE_88800"},"numYields":0,"reslen":11324,"protocol":"op_query","durationMillis":120}}
{"t":{"$date":"2020-08-31T06:25:24.746+02:00"},"s":"I",  "c":"SHARDING", "id":20997,   "ctx":"Uptime-reporter","msg":"Refreshed RWC defaults","attr":{"newDefaults":{}}}
{"t":{"$date":"2020-08-31T06:35:19.076+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn142636","msg":"Slow query","attr":{"type":"command","ns":"cdrarch.cdr_af_20200830","appName":"MongoDB Shell","command":{"moveChunk":"cdrarch.cdr_af_20200830","find":{"SHARD_MINSEC":608.0},"to":"db_rs033","_waitForDelete":true,"lsid":{"id":{"$uuid":"f734929b-2bc9-4c58-ab1b-a5027730cb7e"}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1598847531,"i":2}},"signature":{"hash":{"$binary":"base64":"AvmYRG2AnH4Ui6mH7I22bm4c64w=","subType":"0"}},"keyId":6821045628173287455}},"$db":"admin"},"numYields":0,"reslen":175,"protocol":"op_msg","durationMillis":987521}}
{"t":{"$date":"2020-08-31T06:35:19.485+02:00"},"s":"I",  "c":"SH_REFR",  "id":24104,   "ctx":"ConfigServerCatalogCacheLoader-1495","msg":"Refreshed cached collection","attr":{"namespace":"cdrarch.cdr_af_20200830","newVersion":{"0":{"$timestamp":{"t":3910,"i":1}},"1":{"$oid":"5f46d8b6da305f92f9b46bdf"}},"oldVersion":" from version 3909|1||5f46d8b6da305f92f9b46bdf","durationMillis":7}}

So you see that the “moveChunk” command took nearly 1000 secs.

I many of my shard servers, I see following WiredTiger errors in the log:

{"t":{"$date":"2020-08-29T22:05:27.033+02:00"},"s":"E",  "c":"STORAGE",  "id":22435,   "ctx":"thread1","msg":"WiredTiger error","attr":{"error":95,"message":"[1598731527:33588][86:0x7fa900293700], log-server: __posix_std_fallocate, 58: /var/lib/mongodb/journal/WiredTigerTmplog.0000000007: fallocate:: Operation not supported"}}

Are these “fallocate” errors harmless or not ?

I see that this file is perfectly existing:

# pwd
# ls -ltr
total 20560
-rw------- 1 mongodb mongodb 104857600 Aug 30 15:59 WiredTigerPreplog.0000000007
-rw------- 1 mongodb mongodb 104857600 Sep  1 09:52 WiredTigerPreplog.0000000010
-rw------- 1 mongodb mongodb 104857600 Sep  1 17:45 WiredTigerLog.0000000997

Hi @Rob_De_Langhe

What are the vital statistics, OS, Mongo Version. And what is the filesystem type of /var/lib/mongodb/journal

df -T /var/lib/mongodb/journal

hi Chris,
thx a lot for your feedback.
The servers run Ubuntu 18, MongoDB 4.4.0, all inside LXC containers, 36 shards of 3 servers each.
The servers are nearly idling (some 10-15% busy CPU, several 10’s of GB RAM free).
The filesystems underneath the LXC containers (including their MongoDB directories) is ZFS.
I read about some issue with ZFS and the “fallocate” system calls to pre-allocate big files like those journal files.
So I am now gradually converting my servers away from ZFS and towards LVM/mdadm, very unfortunately because we will miss the flexibilities and ease of working with ZFS.
But still I suspect that the chunk inventory on my config servers is somehow corrupted, and no longer being updated efficiently with new moves (hence the dramatic delays, I think). Also new moves requests take very long time and then seem to be suddenly cancelled or skipped because they terminate without any message but still are not executed (as shown by “sh.status()” ).
Is this assumption possible? Is there some check that I could run to verify that all internal collections (like the chunks locations) are still consistent ?

thx again for any further feedbacks, mostly appreciated !

Hi Rob,

Did you have some progress since your last post ?

I’m facing the same kind of problem (on a lighter cluster : 2 replicated shards)
Due to massive delete and may be “bad” partition key, one of the shard was full of empty chunks.
After a merging procedure, the balancer started his job to balance the 2 shard.
During 24h, the moving time of the (~32->64m) chunk was not quick, but acceptable (something like 30").
Suddenly, in a couple of hours, the perf degraded, and now we need thousands of seconds to move a single chunk.
If we liberate the cluster from heavy RW load, it doesn’t change anything.

The only related traces are the _recvChunkStatus you see too.
We don’t see “fallocate” errors.

hi Franck, Chris,

still facing the same issue: chunk migrations run terribly slow, so slow that they do not complete entirely and thus we can not start loading data in MongoDB because the chunks are still on a single shard…

Performance wise, the underlying physical servers are nearly fully idle.
We migrated all the filesystems away from ZFS, and towards BTRFS, so ZFS not being supported by MongoDB should be no more possible cause.
No bad or suspicious logs are reported, just simply terribly slow (30 mins or more for 1 chunk).

After some hours, the shell session gets disconnected (too long busy, or idle). I re-launch the same chunk movements, and the same chunks that were reported during the previous attempt as having been moved, are being moved again ?!

-> how can we check what is happening during the chunk movements, which is delaying the progress?

By the way: all is running now version 4.4.1

Typical logs messages on our config servers (replica set of 3 servers) report about slow queries, but not sure if that is the cause or a consequence of the problems:

(the node name “mongop_rtr1” is one of our router servers)

{"t":{"$date":"2020-10-02T02:59:27.634+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn3845","msg":"Slow query","attr":{"type":"command","ns":"config.$cmd","command":{"update":"mongos","bypassDocumentValidation":false,"ordered":true,"updates":[{"q":{"_id":"mongop_rtr1:27017"},"u":{"$set":{"_id":"mongop_rtr1:27017","ping":{"$date":"2020-10-02T00:59:27.533Z"},"up":6361,"waiting":true,"mongoVersion":"4.4.1","advisoryHostFQDNs":[""]}},"multi":false,"upsert":true}],"writeConcern":{"w":"majority","wtimeout":60000},"maxTimeMS":30000,"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1601600365,"i":1}},"signature":{"hash":{"$binary":{"base64":"dlGdgrHNtIx9THiJXv/LoISsF64=","subType":"0"}},"keyId":6821045628173287455}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1601600365,"i":1}},"t":12}},"$db":"config"},"numYields":0,"reslen":619,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":3}},"ReplicationStateTransition":{"acquireCount":{"w":3}},"Global":{"acquireCount":{"r":2,"w":1}},"Database":{"acquireCount":{"r":1,"w":1}},"Collection":{"acquireCount":{"w":1}},"Mutex":{"acquireCount":{"r":2}},"oplog":{"acquireCount":{"r":1}}},"flowControl":{"acquireCount":1,"timeAcquiringMicros":1},"writeConcern":{"w":"majority","wtimeout":60000},"storage":{},"protocol":"op_msg","durationMillis":100}}

Another one from the logs of config server that might be relevant (or not, who can tell?) :

{"t":{"$date":"2020-10-02T12:14:25.743+02:00"},"s":"I",  "c":"SHARDING", "id":4608400, "ctx":"PeriodicShardedIndexConsistencyChecker","msg":"Skipping sharded index consistency check because feature compatibility version is not fully upgraded"}

Checking further on this “feature compatibility version” log: I ran the following command against my primary server of the configs replica set :

db.adminCommand( { getParameter: 1, featureCompatibilityVersion: 1 } )

which returned me the following output:

connecting to: mongodb://mongop_cfg1:27019/cdrarch?authSource=admin&compressors=disabled&gssapiServiceName=mongodb
Implicit session: session { "id" : UUID("6fa9ef61-2a3e-4c4e-b50f-616ab9631119") }
MongoDB server version: 4.4.1
    "featureCompatibilityVersion" : {
            "version" : "4.2"
    "ok" : 1,
    "$gleStats" : {
            "lastOpTime" : Timestamp(0, 0),
            "electionId" : ObjectId("7fffffff000000000000000c")
    "lastCommittedOpTime" : Timestamp(1601634848, 67),
    "$clusterTime" : {
            "clusterTime" : Timestamp(1601634848, 67),
            "signature" : {
                    "hash" : BinData(0,"7JA+LfjqaFLdPtYykM/0jUkVCUY="),
                    "keyId" : NumberLong("6821045628173287455")
    "operationTime" : Timestamp(1601634848, 67)

Notice this version=4.2 ; is this as expected, since we are running with MongoDB binaries of version 4.4.1 ?
I have run this same command against the primaries of all my replica sets, and they all reply with this version=4.2, while running MongoDB binaries of version 4.41
-> is that ok, or not ?

Raised the log verbosity on the config shards for the “config” dbase from level 0 to 1, for the “QUERY” logs, and got very many of these logs:

{"t":{"$date":"2020-10-02T13:04:54.533+02:00"},"s":"D1", "c":"QUERY",    "id":22790,   "ctx":"conn463","msg":"Received interrupt request for unknown op","attr":{"opId":7396919}}
{"t":{"$date":"2020-10-02T13:04:54.615+02:00"},"s":"D1", "c":"QUERY",    "id":22790,   "ctx":"conn317","msg":"Received interrupt request for unknown op","attr":{"opId":7396921}}
{"t":{"$date":"2020-10-02T13:04:54.623+02:00"},"s":"D1", "c":"QUERY",    "id":22790,   "ctx":"conn20","msg":"Received interrupt request for unknown op","attr":{"opId":7396923}}

-> normal, or not ?

And finally, the reporting about terribly slow queries for the chunks:

{"t":{"$date":"2020-10-02T13:07:26.576+02:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn3834","msg":"Slow query","attr":{"type":"command","ns":"admin.$cmd","appName":"MongoDB Shell","command":{"_configsvrMoveChunk":1,"ns":"cdrarch.cdr_mobi_20200830","min":{"SHARD_MINSEC":2015.0},"max":{"SHARD_MINSEC":2016.0},"shard":"db_rs002","lastmod":{"$timestamp":{"t":3692,"i":1}},"lastmodEpoch":{"$oid":"5f53fc15da305f92f9cee0be"},"toShard":"db_rs036","maxChunkSizeBytes":67108864,"secondaryThrottle":{},"waitForDelete":true,"forceJumbo":false,"writeConcern":{"w":"majority","wtimeout":15000},"lsid":{"id":{"$uuid":"236d3eab-d1d8-4757-a5f7-0bea4c46f8a3"},"uid":{"$binary":{"base64":"YtJ8CVGJPpojGlBhlVfpmkB+TWiGCwPUvkGEjp5tty0=","subType":"0"}}},"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1601635925,"i":17}},"signature":{"hash":{"$binary":{"base64":"dsDAd22QCBRQKtavuAirYaWi3bI=","subType":"0"}},"keyId":6821045628173287455}},"$audit":{"$impersonatedUsers":[{"user":"mongo-admin","db":"admin"}],"$impersonatedRoles":[{"role":"root","db":"admin"}]},"$client":{"application":{"name":"MongoDB Shell"},"driver":{"name":"MongoDB Internal Client","version":"4.4.1"},"os":{"type":"Linux","name":"Ubuntu","architecture":"x86_64","version":"18.04"},"mongos":{"host":"mongop_rtr2:27017","client":"","version":"4.4.1"}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1601635925,"i":17}},"t":12}},"$db":"admin"},"numYields":0,"reslen":537,"locks":{"ParallelBatchWriterMode":{"acquireCount":{"r":11}},"ReplicationStateTransition":{"acquireCount":{"w":14}},"Global":{"acquireCount":{"r":11,"w":3}},"Database":{"acquireCount":{"r":6,"w":3}},"Collection":{"acquireCount":{"r":2,"w":3}},"Mutex":{"acquireCount":{"r":9}},"oplog":{"acquireCount":{"r":4}}},"flowControl":{"acquireCount":3,"timeAcquiringMicros":3},"writeConcern":{"w":"majority","wtimeout":15000},"storage":{},"protocol":"op_msg","durationMillis":921059}}

And to confirm again that no single document needs to be moved, with these chunk movements: there are 0 (=zero) docs in this collection:

 data : 0B docs : 0 chunks : 3601
 Shard db_rs028 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
 Shard db_rs034 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
 Shard db_rs036 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
 Shard db_rs002 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
 Shard db_rs031 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
 Shard db_rs029 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
 Shard db_rs030 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
 Shard db_rs035 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
 Shard db_rs033 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B
 Shard db_rs032 contains 0% data, 0% docs in cluster, avg obj size on shard : 0B

It’s fine. Looks like the FCV upgrade from 4.2 to 4.4 is left to be done. This is the last step of an upgrade and recommended to run in this mode before this final operation:

XFS and ext4 are the recommended filesystems.

Check Keepalives:

  • Ensure that the system default TCP keepalive is set correctly. A value of 300 often provides better performance for replica sets and sharded clusters. See: Does TCP keepalive time affect MongoDB Deployments? in the Frequently Asked Questions for more information.

https: //

Reading back to your original deployment description. You have 36 shards of 3 node replica sets on 12 servers? 9 mongod instances per server? Not to mention the 3 mongod for the config replicaset.

thx Chris for your feedbacks, highly appreciated !

  1. I tried indeed to apply this feature-compatibility version equal to “4.4”, and things started to move suddenly in my chunk movements progress window;
    however I checked the “sh.status” output of that collection, and the chunks were still not being moved.
    So I went looking at the logs of the primary config server, and I noticed messages like this:

    blabla slow query blabla "errMsg":"received command without explicit writeConcern on an internalClient connection { moveChunk: \"cdrarch.cdr_af_20200906\", blabla...

  2. this made me have a look at the write-concern options in the “moveChunk” command, and I changed my command options now to this:
    db.adminCommand({moveChunk: "...", find:{...}, to:"...", _secondaryThrottle: true, writeConcern:{w:2}, _waitForDelete: true })

These two changes did the trick !

  1. yes we run average 8-9 mongod instances per server; the servers are a mixture HP DL380 Gen8 or Gen9, typically some 64-128 GB RAM and 6-12 HDD of 1 or 2TB each; the servers with more resources run obviously more mongod instances
    CPU load is minimal, RAM usage is of course near to 100% (as should be), minimal swap usage but no activity (I think this is for files mapped in virtual memory)

Unless you are vitualising the benefits from sharding may not eventuate. Assuming you are manually configuring WiredTgiger cache appropriately the mongod will experience contention on the filesystem cache which would lead to more disk fetches.

Overall cpu/ram/network/io/filehandles etc are all going to be contended for.

hi again Chris,

many thanks for expressing your concerns :wink:

Unless you are vitualising the benefits from sharding may not eventuate.

yes indeed, we are running all our mongo (router, config, shard) instances in individual LXC containers (Ubuntu 18).

Assuming you are manually configuring WiredTgiger cache appropriately the mongod will experience contention on the filesystem cache which would lead to more disk fetches.

Each of the containers runs on its own (BTRFS) filesystem, so each filesystem has its own cache handling. We configured indeed WiredTiger cache explicitly for each mongod instance, e.g.

  dbPath: /var/lib/mongodb
    enabled: true
      cacheSizeGB: 4
      configString: eviction=(threads_min=15,threads_max=15),eviction_dirty_trigger=2,eviction_dirty_target=1,eviction_trigger=90,eviction_target=80

Overall cpu/ram/network/io/filehandles etc are all going to be contended for.

With the separate containers, this is not an issue. All is segregated.

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.