Migration failed - ConflictingOperationInProgress - Unable to start new balancer operation because this shard is currently donating chunk

Hello World.

We have a five replica set cluster (primary-secondary-arbiter arquitecture) + 1 config replica set and 3 router nodes.

After perfoming a mongorestore of our sharded cluster the data was transfered entirely to one replica set so we had to restart the sharding from scratch.

Right in the begining of the sharding operarion we had a migration that got stuck. In the output of sh.status() command it got stuck running for more than 24h on the same collection and it was not a huge collection (around 500k documents). We’ve found no errors on the logs of the primary config node.

In the changelog it reported a chunckMove error several hours after the begining of the migration though no details of the error.

In an attempt to solve this we have cancelled the migration operation by finding the opId using the db.currentOp(). The operation was terminated “successfuly” since it no longer appears in the output of the db.currentOp() but now for new migrations we get the following error:

“Migration failed” - “ConflictingOperationInProgress: Unable to start new balancer operation because this shard is currently donating chunk”

Our assumption by looking into the mongodb code is that the ActiveMigrationsRegistry has a lock on it and because of that the _activeReceiveChunkState is true and so the _activeReceiveChunkState->constructErrorStatus() is being executed while running the registerDonateChunk method.

Check the following links https://github.com/mongodb/mongo/blob/3f849d508692c038afb643b1acb99b8a8cb98d38/src/mongo/db/s/move_chunk_command.cpp#L138
https://github.com/mongodb/mongo/blob/3f849d508692c038afb643b1acb99b8a8cb98d38/src/mongo/db/s/active_migrations_registry.h#L69

In trying to fix this we have deleted the lock performed on the collection that was being transfered in the config db (primary config node), locks collection and then restarted the balancer - didn’t work.

We have also deleted the document related to this migration in the config.migrationCoordinators in the recepient shard and then restarted the balancer - also didn’t work.

Any idea how to fix this? Our next attempt will be to dump the collection, then delete it and restart the balancer, see if it solves the sharding and then restore the collection.

How could we perform an unlock on the ActiveMigrationsRegistry? Would it solve the problem?

Thanks in advance!

Just to clarify, this error is being caused by the migration of the chunck of the collection that got stuck, that is, the migration that we’ve terminated using the db.killOp(‘MigrationOpId’)

The problem got solved on its own.

We noticed that the db.problematic_collection.count() was different than the db.problematic_collection.countDocuments({}). The count() number was higher and it was decreasing over time.

Today the count() number reached the same amount of the countDocuments({}) number and we have restarted the balancer to see if we had any changes and this time it didn’t got stuck in the conflictingOperations error.

So maybe the problem was that mongo had to clear orphanDocuments first? But why did the .count() numbers droped so slowly? It took several days (2 or 3) to go from around 700k to 500k.

If someone could shed some light on this it would be much appreciated.

Thank you

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.