Hello World.
We have a five replica set cluster (primary-secondary-arbiter arquitecture) + 1 config replica set and 3 router nodes.
After perfoming a mongorestore of our sharded cluster the data was transfered entirely to one replica set so we had to restart the sharding from scratch.
Right in the begining of the sharding operarion we had a migration that got stuck. In the output of sh.status() command it got stuck running for more than 24h on the same collection and it was not a huge collection (around 500k documents). We’ve found no errors on the logs of the primary config node.
In the changelog it reported a chunckMove error several hours after the begining of the migration though no details of the error.
In an attempt to solve this we have cancelled the migration operation by finding the opId using the db.currentOp(). The operation was terminated “successfuly” since it no longer appears in the output of the db.currentOp() but now for new migrations we get the following error:
“Migration failed” - “ConflictingOperationInProgress: Unable to start new balancer operation because this shard is currently donating chunk”
Our assumption by looking into the mongodb code is that the ActiveMigrationsRegistry has a lock on it and because of that the _activeReceiveChunkState is true and so the _activeReceiveChunkState->constructErrorStatus() is being executed while running the registerDonateChunk method.
Check the following links mongo/move_chunk_command.cpp at 3f849d508692c038afb643b1acb99b8a8cb98d38 · mongodb/mongo · GitHub
mongo/active_migrations_registry.h at 3f849d508692c038afb643b1acb99b8a8cb98d38 · mongodb/mongo · GitHub
In trying to fix this we have deleted the lock performed on the collection that was being transfered in the config db (primary config node), locks collection and then restarted the balancer - didn’t work.
We have also deleted the document related to this migration in the config.migrationCoordinators in the recepient shard and then restarted the balancer - also didn’t work.
Any idea how to fix this? Our next attempt will be to dump the collection, then delete it and restart the balancer, see if it solves the sharding and then restore the collection.
How could we perform an unlock on the ActiveMigrationsRegistry? Would it solve the problem?
Thanks in advance!