Chaos testing MongoDB Replica-set and Oplog issues

Vineel_Yalamarthi · September 20, 2022, 2:20am

We use MongoDB 4.2 as a replica-set on three different nodes . We have one primary and two secondary nodes. We are trying to perform Chaos testing (as in randomly rebooting nodes) to see how the data pipelines are performing well or not, as in with two nodes if rest of 2 MongoDB nodes are able to form cluster cluster or not .

On two different MongoDB replica-sets we see different issues. And I am trying to understand the differences between the two.

Issue 1

2022-09-19T20:30:54.921+0000 I REPL [rsBackgroundSync] **Starting rollback due to OplogStartMissing** : Our last optime fetched: { ts: Timestamp(1663318803, 4), t: 95 }. source's GTE: { ts: Timestamp(1663318861, 1), t: 96 }

To give more context around this here are error logs before and after that.

2022-09-19T20:30:54.807+0000 I  REPL     [rsBackgroundSync] Choosing primary as sync source candidate: mongodb-2.mongodb.amazon.svc.cluster.local:27017
2022-09-19T20:30:54.808+0000 I  CONNPOOL [RS] Connecting to mongodb-2.mongodb.amazon.svc.cluster.local:27017
2022-09-19T20:30:54.827+0000 I  REPL     [rsBackgroundSync] Changed sync source from empty to mongodb-2.mongodb.amazon.svc.cluster.local:27017
2022-09-19T20:30:54.919+0000 I  NETWORK  [conn249] received client metadata from 127.0.0.1:44348 conn249: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "4.2.11" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "16.04" } }
2022-09-19T20:30:54.921+0000 I  REPL     [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last optime fetched: { ts: Timestamp(1663318803, 4), t: 95 }. source's GTE: { ts: Timestamp(1663318861, 1), t: 96 }
2022-09-19T20:30:54.921+0000 I  REPL     [rsBackgroundSync] Replication commit point: { ts: Timestamp(0, 0), t: -1 }
2022-09-19T20:30:54.921+0000 I  REPL     [rsBackgroundSync] Rollback using 'recoverToStableTimestamp' method.
2022-09-19T20:30:54.921+0000 I  REPL     [rsBackgroundSync] Scheduling rollback (sync source: mongodb-2.mongodb.amazon.svc.cluster.local:27017)
2022-09-19T20:30:54.921+0000 I  ROLLBACK [rsBackgroundSync] transition to ROLLBACK
2022-09-19T20:30:54.921+0000 I  REPL     [rsBackgroundSync] State transition ops metrics: { lastStateTransition: "rollback", userOpsKilled: 0, userOpsRunning: 127 }
2022-09-19T20:30:54.921+0000 I  REPL     [rsBackgroundSync] transition to ROLLBACK from SECONDARY
2022-09-19T20:30:54.924+0000 I  NETWORK  [rsBackgroundSync] Skip closing connection for connection # 30
2022-09-19T20:30:54.924+0000 I  NETWORK  [rsBackgroundSync] Skip closing connection for connection # 15
2022-09-19T20:30:54.924+0000 I  ROLLBACK [rsBackgroundSync] Waiting for all background operations to complete before starting rollback
2022-09-19T20:30:54.924+0000 I  ROLLBACK [rsBackgroundSync] Finished waiting for background operations to complete before rollback
2022-09-19T20:30:54.924+0000 I  ROLLBACK [rsBackgroundSync] finding common point

Issue2

2022-08-01T02:30:21.915+0000 I  REPL     [replication-0] We are too stale to use mongodb-0.mongodb.amazon.svc.cluster.local:27017 as a sync source. Blacklisting this sync source because our last fetched timestamp: Timestamp(1659314292, 1) is before their earliest timestamp: Timestamp(1659320029, 301) for 1min until: 2022-08-01T02:31:21.915+0000

There are two different classes of errors. I am trying to understand the difference.

Doug_Duncan · September 20, 2022, 3:09am

Hi @Vineel_Yalamarthi and welcome to the MongoDB community.

In the first case it sounds like you shut down a primary member that had writes that had not been replicated to the secondary member. This node then came online and got repromoted to primary status and MongoDB rolled those write back to keep data consistent with the writes that happened while this member was not primary. You can learn more about rollbacks in the documentation.

In the second case it seems like the secondary member was down longer than what the other members oplogs could hold for data. When the secondary came back online it could not reconcile its local oplog with where the other members were and therefore could not catch itself up. If that’s the case, you would need to resync the member.

I am not sure how chaotic you’re being in your testing, what type of resources these machines have, how active the machines are, etc, but rollbacks can happen and are expected if the primary node goes down. As for stale members, if the secondary is down longer than the other member’s oplog window, then it will not be able to manually catch up. You can resize your oplog if you find the default to be too small. Note that the oplog is a capped collection of a given size. How long of a window the oplog stores is dependent on the amount of writes you do on the system. If you have a heavy write system, you might want to size bigger than the default to hold operations for a longer period of time.

Vineel_Yalamarthi · September 20, 2022, 4:07am

@Doug_Duncan I understood your explanation for the second case. Thanks a lot.

Can you elaborate how and why roll backs lead to the error we are getting and how can we avoid that and what should we do to recover from that. I am more interesting in understanding the root cause.

Let’s say at time t1 we have X, Y, Z which are mongo instances and X is primary. Now when X is shutdown and lets say Y took over the role which is PRIMARY.

After a long time, lets say X came back online and found that it used to be PRIMARY and that it has writes that are yet to be written to the SECONDARY.

From here how does it lead to the error “OplogStartMissing” I am confused. Thanks in advance.

Stennie_X · September 20, 2022, 12:02pm

Welcome to the MongoDB Community @Vineel_Yalamarthi!

As @Doug_Duncan mentioned, whatever chaos testing you are doing has made this replica set member too stale to rejoin the replica set.

Issue 1

Starting rollback due to OplogStartMissing : Our last optime fetched: { ts: Timestamp(1663318803, 4), t: 95 }. source’s GTE: { ts: Timestamp(1663318861, 1), t: 96 }
…
2022-09-19T20:30:54.921+0000 I REPL [rsBackgroundSync] Replication commit point: { ts: Timestamp(0, 0), t: -1 }
2022-09-19T20:30:54.921+0000 I REPL [rsBackgroundSync] Rollback using ‘recoverToStableTimestamp’ method.

This member was restarted before writing a stable replication commit point ({ ts: Timestamp(0, 0), t: -1 } is a sentinel value) and does not have an oplog entry in common with the sync source. There is an attempt to Rollback by recovering to a stable timestamp.

Issue 2

We are too stale to use X:27017 as a sync source.

The latest oplog entry on this member is older than the oldest oplog entry for the sync source. Unless you happen to have a sync source with a larger oplog available (perhaps another member down during your chaos testing), you will have to Resync this member.

Mitigation approaches

To try to avoid members becoming stale while offline you can:

increase your oplog sizes
use majority write concern (the default in MongoDB 5.0+ compatible drivers) to ensure commits propagate to a majority of replica set members
upgrade to MongoDB 4.4 or newer and configure storage.oplogMinRetentionHours to guarantee an oplog window (with the caveat that the oplog will grow without constraint to meet this configuration setting)

MongoDB 4.2 (first released in Aug 2019) is currently the oldest non-EOL server release series. I recommend planning an upgrade to a newer server release series to take advantage of improvements and features that have been added in the last 3 years.

Regards,
Stennie

Vineel_Yalamarthi · September 20, 2022, 3:36pm

@Stennie_X The second one is clear. Thanks so much for taking time to answer my question.

What does OplogStartMissing mean? I know what oplog is. But what does OplogStartMissing has to do with Rolling Back?
You mentioned that this member was restarted before a stable replication commit point. How can you infer that. What is the stable replication commit point?