Mongorestore crashing shard replica set

Brendan_61585 · April 16, 2022, 2:58am

Hi all,

We have a sharded cluster deployed 5 mongos/config servers and 4 shard replica sets all deployed across datacenters. Shards are deployed with 1 Primary 2 secondary members each. This database is not sharded, all the data remains on just one shard. We recently attempted to restore about 40,000 documents to one of the collections in the database using mongorestore, but this caused the shard replica set to fail.

For some reason, running the mongorestore caused replication to stop working on the replica set. As soon as I stopped the restore job, the cluster returned to normal and was able to elect a primary. We got the below error message: “Host failed in replica set” where the primary fails and the replica set failed to elect a new primary. The cluster was down and no users were able to access the database at that time.

I have looked through the logs, but can’t seem to find anything to indicate what exactly caused this, we have run similar restore jobs before with no issues.

Any help with this would be greatly appreciated.

Stennie_X · April 16, 2022, 3:03am

Welcome to the MongoDB Community @Brendan_61585 !

Can you please provide some more information about your environment:

specific version of mongorestore used
specific version of MongoDB server used
command-line options used for mongorestore (with any user/password/hostname details anonymised)

Thanks,
Stennie

Brendan_61585 · April 16, 2022, 3:24am

Hi Stennie,

We are using MongoDB server 4.4.9 and the database tools version 100.5.1

The below command was used from the mongos server to restore to the collection:
nohup mongorestore --db xxxxxx -c xxxxxx --port 27017 -u xxxxxx --password xxxxx --authenticationDatabase admin --noIndexRestore /Path/to/the/files.bson 2>&1 | tee restore.txt

We use nohup on our restores in case it takes awhile and we get logged out of the machine. We also write the restore output to a file so that we can check back for any errors.

The primary is stepping down while waiting for replication.

Here is a more detailed error message from the mongos:
{“t”:{"$date":“2022-04-15T17:09:29.308-05:00”},“s”:“I”, “c”:“CONTROL”, “id”:20710, “ctx”:“LogicalSessionCacheRefresh”,“msg”:“Failed to refresh session cache, will try again at the next refresh interval”,“attr”:{“error”:"WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true, writeConcern: { w: “majority”, wtimeout: 15000, provenance: “clientSupplied” } }

Please let me know if you need anymore information.