Hello,
I currently use an old in-house structured streaming connector to read the mongo db change stream and to write it to a data warehouse. This connector is quitte old and only compatible with spark 2.X however it has some nice capabilities that the latest streaming does not have yet (I believe). I would like to discuss them to know if they are planned and if it is possible to be involved into their development as an open-source contribution instead of maintaining a closed source third party connector.
So, mostly the features which I am looking for are the following:
- Provide meta data about the operation of each row operation type and operation time and document ID to be able to perform deletion and document deduplication on the data warehouse
- Provide an access to the resume token and allow to try to restart on a resume token or a specified date to ease error recovering of the streaming application
- Read a full collection to perform initial data load
I have some time to word on the folowing subject in the comming weeks since my company is actively looking for migrating to Spark 3. I hope we could collaborate soon.
Best regards