MongoDB backup and restore error using Velero and MinIO (on-premise Kubernetes cluster)

Eric_Hemmerlin · April 25, 2023, 12:25pm

Backuping a MongoDB cluster composed of three replicated MongoDB instances on a Kubernetes on-premise cluster using Velero and MinIO with Restic, triggers this fatal error of one of them after restoring the backup:

"ERROR","verbose_level_id":-3,"msg":"__wt_block_read_off:226:WiredTigerHS.wt: potential hardware corruption, read checksum error for 4096B block at offset 172032: block header checksum of 0x63755318 doesn't match expected checksum of 0x22b37ec4"
"ERROR","verbose_level_id":-3,"msg":"__wt_block_read_off:235:WiredTigerHS.wt: fatal read error","error_str":"WT_ERROR: non-specific WiredTiger error","error_code":-31802
"ERROR","verbose_level_id":-3,"msg":"__wt_block_read_off:235:the process must exit and restart","error_str":"WT_PANIC: WiredTiger library panic","error_code":-31804
Fatal assertion","attr":{"msgid":50853,"file":"src/mongo/db/storage/wiredtiger/wiredtiger_util.cpp","line":712
\n\n***aborting after fassert() failure\n\n
Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n

Please note that we tested it using versions 4.4.11 and 6.0.5

The restore works well for all our application (including two MongoDB nodes) except one (or two sometimes) MongoDB node which is most of the time in a “Back-off restarting failed container” state (even after having triggered a manual “mongod --repair” on it).

We think that doing the backup of the three replicated MongoDB instances, maybe when some MongoDB synchronisation is ongoing (the services connected to MongoDB are all off during the backup), causes the backup to be seen as corrupted during the restore. Do you know what could cause this issue and how we could solve it?

Aasawari · May 3, 2023, 3:54am

Hi @Eric_Hemmerlin and welcome to MongoDB community forums!!

The error message in the log you posted seem to indicate that the backup is corrupt, or the hardware is corrupt.
Could you share a few details of the hardware on which the cluster is deployed like CPU, RAM, core, free space etc.

However, since the cluster is deployed using different technologies mentioned, one of it could also be the possible reason of the failure.
The suggestion here would be to debug at each stack and let us know if the issues are specific to MongoDB?
Note that if the underlying issue was caused by incomplete backup or corrupt hardware, there’s not much a database can do to overcome them.

It seems to me like you are backing up all three nodes separately. Is this correct? Note that for a MongoDB replica set, typically you only need to backup one node, since the set contains identical data. You can restore this data to three nodes as per the restoring replica set documentation

Regards
Aasawari

Eric_Hemmerlin · June 6, 2023, 3:39pm

Hello @Aasawari and thanks for your answer, I appreciate it. Yes you are right, we backed up all three nodes separately, so we changed our mind after reading your post in order to only backup one node. We hope it’ll fix the issue we had.
Regards
Eric

system · June 20, 2023, 6:06am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.