Hi,
I have recently had a problem with the MongoDB Source Connector failing with the error message Failed to resume change stream: Resume of change stream was not possible, as the resume point may no longer be in the oplog. 286
.
I’m aware what this means, i.e the connector cannot longer find the offset, so it cannot reliably continue consuming the change stream → fatal error.
To recover the failing connector, the now faulty resume token must be tombstoned / removed from the .offset topic. This is of course time consuming as well as error prone, and more things.
I’m wondering how the Source Connector ended up in this situation, and what can be done to mitigate it.
Setup
- One MongoDB Atlas Cluster that hosts both
DatabaseA
andDatabaseB
-
DatabaseA.user
- has very bursty traffic pattern, sometimes weeks go by in silence, sometimes 1000s of users are inserted in a short period of time
-
user-source-connector
configured on this collection
-
DatabaseB.events
very busy collection with lots of traffic all the time - The Metric
Replication Oplog Window
usually sits around 5 days with the normal traffic patterns on the cluster
Timeline
- Nothing has been going on in the
DatabaseA.user
for weeks - The
user-source-connector
has been running happily for weeks - Something caused the
user-source-connector
to reboot - On restart, the
user-source-connector
enters boot-loop-failure due toFailed to resume change stream: Resume of change stream was not possible, as the resume point may no longer be in the oplog. 286
Suspicion
- the oplog is shared for both
DatabaseA
andDatabaseB
- The activity in
DatabaseB
causes the resume token forDatabaseA.user
to be evicted, thus not found when it’s required upon restart of the user-source-connector
If this indeed is true, what are some ways of mitigating the risk of this happening?
best regards
Marcus Wallin