Add meta data about the operation type from the stream and use resume token to restart at a specific location

Sylvain_Letreguilly · October 17, 2022, 7:37am

Hello,

I currently use an old in-house structured streaming connector to read the mongo db change stream and to write it to a data warehouse. This connector is quitte old and only compatible with spark 2.X however it has some nice capabilities that the latest streaming does not have yet (I believe). I would like to discuss them to know if they are planned and if it is possible to be involved into their development as an open-source contribution instead of maintaining a closed source third party connector.

So, mostly the features which I am looking for are the following:

Provide meta data about the operation of each row operation type and operation time and document ID to be able to perform deletion and document deduplication on the data warehouse
Provide an access to the resume token and allow to try to restart on a resume token or a specified date to ease error recovering of the streaming application
Read a full collection to perform initial data load

I have some time to word on the folowing subject in the comming weeks since my company is actively looking for migrating to Spark 3. I hope we could collaborate soon.

Best regards