Hello,
We have been using MongoDB 4.4 replica set
long story short, we have encountered very strange data corruption which I could not replicate separately.
I assume the server crashed or was forcefully shut down, which caused initial data state corruption. Unfortunately, secondaries were behind, so they were not usable in this case for data recovery.
We of course ran database repair, and it seemed that everything worked, except we noticed some data is not updating in the server.
After checking further we noticed that on some documents where we did update by _id field, the operation just stuck and never returned on driver or DB side (even did not trigger timeouts)
It seems that index was pointing to a document that did not exist anymore, as the find command would not return anything for that _id and update was just stuck forever and seconds_running was always almost equal acquiring lock count
We found that when we would validate such collections, it would produce errors, so we reIndexed them. But since queries were just stuck, it was very hard to identify which collections need repairing.
We tried to make a script like to validate all collections and reIndex if needed
mongo db_name --quiet --eval 'db.getCollectionNames().forEach(function(c){print("checking " + c); var res = db[c].validate(); if(res.errors.length){print("reindexing " + c); db[c].reIndex()}})' > reindex.log 2>&1 &
But running it on the production server consumed too many resources and was crashing mongod process again.
So instead we were trying to patch collections as we discovered them while letting the secondary do the initial resync.
And my first question would be, do you think having initial resync would be enough to get rid of such data corruption and we can rely on secondary being clean, and not stuck at some document again?
And any pointers into what exactly happened and how we can prevent it next time?