Recovery from data corruption

Arturs_Sosins · September 22, 2021, 10:23am

Hello,

We have been using MongoDB 4.4 replica set
long story short, we have encountered very strange data corruption which I could not replicate separately.

I assume the server crashed or was forcefully shut down, which caused initial data state corruption. Unfortunately, secondaries were behind, so they were not usable in this case for data recovery.

We of course ran database repair, and it seemed that everything worked, except we noticed some data is not updating in the server.

After checking further we noticed that on some documents where we did update by _id field, the operation just stuck and never returned on driver or DB side (even did not trigger timeouts)

It seems that index was pointing to a document that did not exist anymore, as the find command would not return anything for that _id and update was just stuck forever and seconds_running was always almost equal acquiring lock count

We found that when we would validate such collections, it would produce errors, so we reIndexed them. But since queries were just stuck, it was very hard to identify which collections need repairing.

We tried to make a script like to validate all collections and reIndex if needed
mongo db_name --quiet --eval 'db.getCollectionNames().forEach(function(c){print("checking " + c); var res = db[c].validate(); if(res.errors.length){print("reindexing " + c); db[c].reIndex()}})' > reindex.log 2>&1 &

But running it on the production server consumed too many resources and was crashing mongod process again.

So instead we were trying to patch collections as we discovered them while letting the secondary do the initial resync.

And my first question would be, do you think having initial resync would be enough to get rid of such data corruption and we can rely on secondary being clean, and not stuck at some document again?

And any pointers into what exactly happened and how we can prevent it next time?

kevinadi · September 27, 2021, 2:34am

Hi @Arturs_Sosins welcome to the community!

Sorry to hear about your issues. Although it’s a serious issue as described, if you don’t mind I’d like to collect some facts regarding the underlying deployment.

We have been using MongoDB 4.4 replica set

What is the exact version you’re using? MongoDB 4.4.9 was released very recently, and there were issues with 4.4.2 to 4.4.8 that are currently not recommended for production use due to issues identified with them. See MongoDB 4.4 series release notes for details.

Unfortunately, secondaries were behind, so they were not usable in this case for data recovery.

Note that in a standard replica set setting, any secondary node can take over as a primary as soon as they detect that the primary is not accessible. In this case, I’m wondering if you’re changing the default replica set setting by e.g. changing the votes or node priorities. If possible:

Please post the output of rs.conf() and rs.status() of the replica set.
What are the hardware involved? Are each node on a separate hardware?
Does this mean that the secondaries are not affected by this issue?

After checking further we noticed that on some documents where we did update by _id field, the operation just stuck and never returned on driver or DB side (even did not trigger timeouts)

Could you post some additional details:

What is the update operation you’re doing
How are you doing the update? Are you using an app with a specific MongoDB driver?

So instead we were trying to patch collections as we discovered them while letting the secondary do the initial resync.

Could you elaborate on this process a little? Initial sync is a process where a secondary is either a new node, or have fell off the oplog so it cannot catch up to the primary again. Are you doing a resync on all secondaries, while fixing the data issues in the primary?

And my first question would be, do you think having initial resync would be enough to get rid of such data corruption and we can rely on secondary being clean, and not stuck at some document again?

A secondary is practically an exact copy of the primary, since their main purpose is to step up as primary if there’s any issue with the current primary (this is also the reason why it’s recommended to setup the secondaries with the same hardware as the primary). So if the current primary has been cleaned, the secondaries should be as well.

Best regards
Kevin

Arturs_Sosins · September 27, 2021, 5:48am

Hello, good to know about the bugs, yes, version 4.4.6 was used, so we will be upgrading instances ASAP

And yes, other replica set members were set to have 0 priority, it was done specifically, so no worries about that

I guess my main question is, if primary data is corrupted in some way (for example in a way that is mentioned at https://jira.mongodb.org/browse/WT-7984) then doing initial sync for secondary, would corruption propagate to secondary, or it would correct it?

But I guess the answer is yes, it would correct it, from the same link, as it is a suggested remediation. We just need to upgrade first and resync again