Can a rollback happen without traces?

B_S · January 25, 2024, 12:05pm

Recently, it was discovered that during an issue we had over two months ago with an internal staging DB (the setup is the same as production), data was lost.

As this was only noticed recently, it is tricky to find more information on the issue, but the current working hypothesis is the following:

Info: We have a 3-Member Replica set

The primary crashed (for currently unknown reasons, memory issues are likely)
A secondary took over for multiple hours and new data was entered during that time
The secondary started failing as well, causing the whole replica set to go down
Both the secondary as well as the primary were manually restarted
The cluster started showing regular behaviour again

However, it seems that the data gathered in between 2 and 3 was gone (luckily we had backups from that time).

Now: Usually, if a rollback happens, there should be a /rollback folder according to the documentation, which does not seem to be there.

We have already implemented an even more rigorous monitoring/alerting and I am aware that this is a long-shot, as it was so long ago that most of our logs are gone already, but do you have any other hypotheses about what could’ve happened that could explain the data being completely gone?

Thank you for your time,
Ben

B_S · February 22, 2024, 11:51am

By now, we have tried to further analyzed the topic within our means and believe the crashes were caused by memory issues. Still, this does not explain the non-existent rollback.
Are there any other places we could look at?

Thank you!