Capture id of deleted document with Spark Structured Streaming

Amer_Aljovic · June 1, 2022, 9:33am

Hi,

I’m trying to replicate a MongoDB collection to Delta Lake using the Spark Connector with structured streaming but there is one problem.
When using the option change.stream.publish.full.document.only=true I won’t get the deleted document. But that is expected.
But if I omit the option, I only get a row with the _data field. All other fields are null.
I would at least expect to have the _id field so I can delete the entry.

Can someone explain me how to capture deleted documents with structured streaming?

Thanks,
Amer

khang_pham · July 13, 2022, 5:15pm

can you use something like this:

It will be a SparkConf setting so “spark.mongodb.read.aggregation.pipeline”:“[{”$match": {“operationType”: “insert”}]’ for example

ref: MongoDB Connector for Spark V10 and Change Stream - #11 by khang_pham

Krishnamoorthy_Kalidoss · November 25, 2022, 2:59pm

I tried this…but pipeline didnt triggeted…where u exactly want to add pipeline in structured streaming…