We run the community version 4.4.1 and we split our data over weekly collections.
We have 4 new collections per week, and some 1200M records are stored in them every week.
Each of the collections is split into 3600 chunks, which are moved evenly over 108 shards of 3 shard-servers each, while the collection is freshly created and thus entirely empty (=no documents). Auto-balancer is disabled of course.
So per week we have 4x3600 = 14400 chunks.
We have now some 19 weeks of 4 collections/week, and target is to have 52 weeks of collections.
Thus now we have some 19x14400 = 273K chunks
But where our chunk-movements went very fast when we started with this setup, it now goes increasingly slow, even while there is no single other activity going on to/from our cluster (no loading, no queries):
Every chunk movement takes roughly 40 seconds.
When I check the logs of our config servers for some indication of this slowness, I see following logs:
{"t":{"$date":"2020-12-10T11:08:27.462+01:00"},"s":"I", "c":"COMMAND", "id":51803, "ctx":"conn1217790","msg":"Slow query",
"attr":{"type":"command","ns":"config.changelog","appName":"MongoDB Shell","command":{"aggregate":"changelog","pipeline":[{"$match":{"time":{"$gt":{"$date":"2020-12-09T10:08:26.775Z"}},
"what":"moveChunk.from","details.errmsg":{"$exists":false},"details.note":"success"}},{"$group":{"_id":{"msg":"$details.errmsg"},"count":{"$sum":1.0}}},{"$project":{"_id":{"$ifNull":["$_id.msg","Success"]},"count":"$count"}}],"fromMongos":false,"cursor":{"batchSize":101},"readConcern":{"provenance":"implicitDefault"},"writeConcern":{"w":1,"wtimeout":0,"provenance":"implicitDefault"},"lsid":{"id":{"$uuid":"5bad02ed-7a8d-419e-8b94-9750be30558d"},"uid":{"$binary":{"base64":"YtJ8CVGJPpojGlBhlVfpmkB+TWiGCwPUvkGEjp5tty0=","subType":"0"}}},"$clusterTime":{"clusterTime":{"$timestamp":{"t":1607594920,"i":1}},"signature":{"hash":{"$binary":{"base64":"F/KZAbai8D8Qcs/qQWKjLV0BTvM=","subType":"0"}},"keyId":6880911790037794833}},"$audit":{"$impersonatedUsers":[{"user":"mongo-admin", "db":"admin"}],"$impersonatedRoles":[{"role":"root","db":"admin"}]},"$client":{"application":{"name":"MongoDB Shell"},"driver":{"name":"MongoDB Internal Client","version":"4.4.1"},"os":{"type":"Linux","name":"Ubuntu","architecture":"x86_64","version":"18.04"},"mongos":{"host":"mongop_rtr1:27017","client":"10.100.22.100:36242","version":"4.4.1"}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1607594919,"i":55}},"t":7}},
"$db":"config"},
"planSummary":"COLLSCAN",
"keysExamined":0,
"docsExamined":473718,
"cursorExhausted":true,"numYields":473,"nreturned":1,"queryHash":"ABD7C887","planCacheKey":"5B43D47C","reslen":359,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":475}},"Global":{"acquireCount":{"r":475}},"Database":{"acquireCount":{"r":475}},"Collection":{"acquireCount":{"r":475}},"Mutex":{"acquireCount":{"r":2}}},"readConcern":{"provenance":"implicitDefault"},"writeConcern":{"w":1,"wtimeout":0,"provenance":"implicitDefault"},"storage":{},"protocol":"op_msg",
"durationMillis":685}}
I identify the “moveChunk” actions that we initiate, a “config.changelog” namespace, a “COLLSCAN” (collection-scan?), the “docsExamined”=473718, and most suspicious is the “keysExamined”=0
Finally the “durationMillis”=685
Does this mean there is no optimal configuration to scan the list of chunks in the configuration database?
“KeysExamined” sounds to me like there is no index used, or whatever I should interprete as “keys”
I find no hint in the documentation or this forum about how I can check for any detailed causes of slow chunk movements.
Hoping that any of you can shed some light in this darkness
thx in advance!
Rob