Performance Issue and uneven sharding.
Hello Team,
We are recently experiencing a performance issue on the the mongo cluster, where any query takes up 100% CPU on the cluster and the output is rendered after multiple minutes.
I have also noted the following error on the mongos instance.
e are still 1 deletes from previous migration" }, ok: 0.0, errmsg: “moveChunk failed to engage TO-shard in the data transfer: can’t accept new chunks because there are still 1 deletes from previous migration” }
2020-09-20T06:39:22.315+0000 I SHARDING [Balancer] balancer move failed: { cause: { ok: 0.0, errmsg: “can’t accept new chunks because there are still 1 deletes from previous migration” }, ok: 0.0, errmsg: “moveChunk failed to engage TO-shard in the data transfer: can’t accept new chunks because there are still 1 deletes from previous migration” } from: shard0001 to: shard0000 chunk: min: { files_id: ObjectId(‘55fbabfbe4b00ec933a81aa9’) } max: { files_id: ObjectId(‘55fbac2ee4b00ec933a81b4a’) }
Environment: Mongo 3.0.4 With 1 Mongos and 3 Mongod instances. AWS Ec2
MongoDB 3.0.4 is more than 5 years old and the 3.0.x release series reached End of Life (EOL) in February 2018. There have been significant improvements in product architecture and performance since then.
I would start by upgrading your deployment to the final 3.0.15 release, as this includes critical fixes and does not introduce any compatibility issues or backward-breaking behaviour changes (see: Release Notes for MongoDB 3.0). I strongly recommend planning and testing the upgrade path to a supported release series (currently 3.6+).
moveChunk failed to engage TO-shard in the data transfer: can’t accept new chunks because there are still 1 deletes from previous migration”
This log message just indicates a requested destination shard is busy catching up on deletes from a previous migration and not ready to accept a new migration task yet.
You’ll have to look into more of your system metrics and activity timeline to understand if (or why) your query performance might be impacted by balancing activity. One possibility is that your deployment has insufficient resources to keep up with your current workload.
Have you tried explaining your slow queries to confirm efficient index usage? Queries that are poorly supported by indexes or doing additional processing (regular expressions, JavaScript) are common offenders.