How to prevent "Failed to resume change stream" on rarely changed collections

Hi,

I have recently had a problem with the MongoDB Source Connector failing with the error message Failed to resume change stream: Resume of change stream was not possible, as the resume point may no longer be in the oplog. 286.

I’m aware what this means, i.e the connector cannot longer find the offset, so it cannot reliably continue consuming the change stream → fatal error.

To recover the failing connector, the now faulty resume token must be tombstoned / removed from the .offset topic. This is of course time consuming as well as error prone, and more things.

I’m wondering how the Source Connector ended up in this situation, and what can be done to mitigate it.

Setup

  • One MongoDB Atlas Cluster that hosts both DatabaseA and DatabaseB
  • DatabaseA.user
    • has very bursty traffic pattern, sometimes weeks go by in silence, sometimes 1000s of users are inserted in a short period of time
    • user-source-connector configured on this collection
  • DatabaseB.events very busy collection with lots of traffic all the time
  • The Metric Replication Oplog Window usually sits around 5 days with the normal traffic patterns on the cluster

Timeline

  • Nothing has been going on in the DatabaseA.user for weeks
  • The user-source-connector has been running happily for weeks
  • Something caused the user-source-connector to reboot
  • On restart, the user-source-connector enters boot-loop-failure due to Failed to resume change stream: Resume of change stream was not possible, as the resume point may no longer be in the oplog. 286

Suspicion

  • the oplog is shared for both DatabaseA and DatabaseB
  • The activity in DatabaseB causes the resume token for DatabaseA.user to be evicted, thus not found when it’s required upon restart of the user-source-connector

If this indeed is true, what are some ways of mitigating the risk of this happening?

best regards
Marcus Wallin

Hi, check out the article https://www.mongodb.com/docs/kafka-connector/current/troubleshooting/recover-from-invalid-resume-token/#invalid-resume-token

To mitigate this configure a heartbeat interval as described later in that article https://www.mongodb.com/docs/kafka-connector/current/troubleshooting/recover-from-invalid-resume-token/#prevention.

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.