Add meta data about the operation type from the stream and use resume token to restart at a specific location


I currently use an old in-house structured streaming connector to read the mongo db change stream and to write it to a data warehouse. This connector is quitte old and only compatible with spark 2.X however it has some nice capabilities that the latest streaming does not have yet (I believe). I would like to discuss them to know if they are planned and if it is possible to be involved into their development as an open-source contribution instead of maintaining a closed source third party connector.

So, mostly the features which I am looking for are the following:

  • Provide meta data about the operation of each row operation type and operation time and document ID to be able to perform deletion and document deduplication on the data warehouse
  • Provide an access to the resume token and allow to try to restart on a resume token or a specified date to ease error recovering of the streaming application
  • Read a full collection to perform initial data load

I have some time to word on the folowing subject in the comming weeks since my company is actively looking for migrating to Spark 3. I hope we could collaborate soon.

Best regards