Chaos testing MongoDB Replica-set and Oplog issues

Doug_Duncan · September 20, 2022, 3:09am

Hi @Vineel_Yalamarthi and welcome to the MongoDB community.

In the first case it sounds like you shut down a primary member that had writes that had not been replicated to the secondary member. This node then came online and got repromoted to primary status and MongoDB rolled those write back to keep data consistent with the writes that happened while this member was not primary. You can learn more about rollbacks in the documentation.

In the second case it seems like the secondary member was down longer than what the other members oplogs could hold for data. When the secondary came back online it could not reconcile its local oplog with where the other members were and therefore could not catch itself up. If that’s the case, you would need to resync the member.

I am not sure how chaotic you’re being in your testing, what type of resources these machines have, how active the machines are, etc, but rollbacks can happen and are expected if the primary node goes down. As for stale members, if the secondary is down longer than the other member’s oplog window, then it will not be able to manually catch up. You can resize your oplog if you find the default to be too small. Note that the oplog is a capped collection of a given size. How long of a window the oplog stores is dependent on the amount of writes you do on the system. If you have a heavy write system, you might want to size bigger than the default to hold operations for a longer period of time.