Recently, it was discovered that during an issue we had over two months ago with an internal staging DB (the setup is the same as production), data was lost.
As this was only noticed recently, it is tricky to find more information on the issue, but the current working hypothesis is the following:
Info: We have a 3-Member Replica set
- The primary crashed (for currently unknown reasons, memory issues are likely)
- A secondary took over for multiple hours and new data was entered during that time
- The secondary started failing as well, causing the whole replica set to go down
- Both the secondary as well as the primary were manually restarted
- The cluster started showing regular behaviour again
However, it seems that the data gathered in between 2 and 3 was gone (luckily we had backups from that time).
Now: Usually, if a rollback happens, there should be a /rollback folder according to the documentation, which does not seem to be there.
We have already implemented an even more rigorous monitoring/alerting and I am aware that this is a long-shot, as it was so long ago that most of our logs are gone already, but do you have any other hypotheses about what could’ve happened that could explain the data being completely gone?
Thank you for your time,
Ben