Losing events when spark streaming application is in maintenance mode

Hello

I’m using spark streaming to ingest events from mongodb. I understand during first run it can’t ingest old events due to limitations in connector unlike for kafka. But then lets say if I stop spark streaming application for couple of hours for maintenance, during these interval any new events that gets published to mongodb collection are not being fetched by connector even after using checkpoint capability. Wondering how others are handling this scenario as outages can even happen and this could potentially lead to missing data.

Appreciate feedback, able to incrementally get the events using spark.readStream.format (“mongodb”) but not sure how to get the missed events when spark streaming app was down.

additional info checkpoint seems to be only tracking events that come in when spark streaming application is running but if we need to bring down spark application for maintenance we will losing new mongo events that come in collection during that interval. Any thoughts how others are handling this behavour with available connector?

I’m using mongo-spark-connector 10.3.0 which is compatible with spark-3.4.2 as its compatible.

hey Kushal, where is this mentioned in the documentation? Can you point me to them?

@Gary_Tang I have noticed same when triggered spark streaming app using 10.3 connector. Also found corresponding jira where others reported same. Below links if helps, unfortunately its not documented on mongodb docs. For existing data i can do a first run as batch mode which is not a concern, but wondering how one is able to perform maintenance activity on spark application using this connector while events are still coming into mongodb without stopping spark app.

https://jira.mongodb.org/projects/SPARK/issues/SPARK-303?filter=allopenissues