Mongodb replicaset corruption when electric is gone

kevinadi · September 27, 2022, 2:58am

I agree with @Ramachandra_Tummala 's assessment that restoring from backup is probably the best way forward.

However I’m curious about one thing. Here’s the error message from node 1:

{“t”:{"$date":“2022-09-26T08:36:02.439+03:00”},“s”:“E”, “c”:“STORAGE”, “id”:22435, “ctx”:“initandlisten”,“msg”:“WiredTiger error”,“attr”:{“error”:-31809,“message”:"[1664170562:439217][94507:0x7f741c10abc0], connection: __wt_turtle_read, 391: WiredTiger.turtle: fatal turtle file read error: WT_TRY_SALVAGE: database corruption detected"}}
{“t”:{"$date":“2022-09-26T08:36:02.439+03:00”},“s”:“E”, “c”:“STORAGE”, “id”:22435, “ctx”:“initandlisten”,“msg”:“WiredTiger error”,“attr”:{“error”:-31804,“message”:"[1664170562:439270][94507:0x7f741c10abc0], connection: __wt_turtle_read, 391: the process must exit and restart: WT_PANIC: WiredTiger library panic"}}

and here’s the error message from node 2:

{“t”:{"$date":“2022-09-26T08:37:05.674+03:00”},“s”:“E”, “c”:“STORAGE”, “id”:22435, “ctx”:“initandlisten”,“msg”:“WiredTiger error”,“attr”:{“error”:-31809,“message”:"[1664170625:674753][122834:0x7f77fd5a8bc0], connection: __wt_turtle_read, 391: WiredTiger.turtle: fatal turtle file read error: WT_TRY_SALVAGE: database corruption detected"}}
{“t”:{"$date":“2022-09-26T08:37:05.674+03:00”},“s”:“E”, “c”:“STORAGE”, “id”:22435, “ctx”:“initandlisten”,“msg”:“WiredTiger error”,“attr”:{“error”:-31804,“message”:"[1664170625:674804][122834:0x7f77fd5a8bc0], connection: __wt_turtle_read, 391: the process must exit and restart: WT_PANIC: WiredTiger library panic"}}

It strikes me as odd that both of them seem to have an identical error:

WiredTiger.turtle: fatal turtle file read error: WT_TRY_SALVAGE: database corruption detected

I understand that this is a PSA setup, but I noticed that the two secondaries have the exact same error. Note that the file WiredTiger.turtle is a vital file, so WiredTiger is very, very careful in handling this file in particular.

To have the same error of this magnitude at the same time on two different nodes is so highly unlikely that there may be other reason behind this. How are you deploying the mongod processes? Are they sharing disk, CPU, or something? What’s the spec of the deployment hardware/architecture?

Best regards
Kevin