Mongod --repair appears to have stalled, what now?

smaderak · September 9, 2022, 2:30pm

We have a MongoDB v4.4 instance with several TB of data in it. The instance experienced a bad shutdown (the cause of the shutdown has yet to be determined), and the recovery process, once completed, didn’t allow it to start again. A few days ago we started the repair process. It is still running, but has had no console/log updates in over a day, and neither top nor iotop show much activity at all. So, we have a few questions for the experts in the community.

• How long should a repair process take on 6-ish TB of data? Is that even something that can be predicted?
• Should we stop the repair process since it appears to have stalled, or do we just wait (and hope)?
• Other than the console/log messages is there any way to monitor a repair process?
• Are there any other suggestions as to how to troubleshoot this problem?

Thanks for any help.

Satyam · September 14, 2022, 5:29am

Hey @smaderak,

Welcome to the MongoDB Community Forums!

It’s been five days since you posted, has this been resolved or you are still facing issues?

Did you use mongod --repair CLI command to start the repair process? Please make sure sufficient disk space is available to perform the database repair - in extreme cases, a repair may require as much space as is currently occupied by the MongoDB database. Also, repair process rebuilds indexes as required. So for 6 TB of data, it can take a while.

Other than the console/log messages is there any way to monitor a repair process?

You can check your system logs for messages pertaining to MongoDB. For example, for logs located in /var/log/messages, use the following commands:

sudo grep mongod /var/log/messages
sudo grep score /var/log/messages

You may be able to try to use tools like dtrace or a similar tool to trace what files are being accessed by mongod

Should we stop the repair process since it appears to have stalled, or do we just wait (and hope)?

It would be better to first check the last entries in the logs before it got stuck before you decide doing this. I’m also attaching the documentation link for you to go through which should be able to help you out.
Recover a Standalone after an unexpected shutdown

Hope this helps. Also, for future, I would also highly suggest to run a replica set instead of a standalone in production environments so cases like this can be solved by resyncing the affected node instead of suffering any downtime and data loss.

Please feel free to reach out for anything else as well.

Regards,
Satyam

smaderak · October 13, 2022, 3:34pm

Sorry for my delayed response.

In the end we opted to restore from a backup even though that did cause a bit of data loss. We decided that we couldn’t wait any longer.

We did use the --repair flag, and there was plenty of disk space available. In the future we will keep in mind that more information could be in the system messages.

I agree that a replica set would be better, but I personally don’t make all the decisions.

Thanks for the information.