Spark connector skipping update events made on the same document (while reading stream from MongoDB)

Artur_Poplawski · February 14, 2023, 8:35am

While experimenting with Spark connector I noticed that when changes to a single document in MongoDB are made very fast, only the latest version of a document will be sent to Spark connector. In such cases, very often number of events sent to Spark is correct, but all events contain the latest version of a given document. It that a normal behavior? How can I avoid it, and have the whole change history sent to Spark?

Technical details

I’m using pyspark 3.3.1, connector mongo-spark-connector_2.12:10.1.1, Mongodb 4.4.13
I have spark.mongodb.change.stream.publish.full.document.only property set to True.

Example

Sample code can be found here.

For instance, say I’m modifying a collection in Mongo that consists only of docs like (_id, version).
Running 2 separate update operations in a row (updating version from 1 to 2, and afterwards from 2 to 3), without any delay, Spark in a single batch receives:

-------------------------------------------
Batch: 10
-------------------------------------------
+------------------------+-------+
|_id                     |version|
+------------------------+-------+
|63eaa487426f9a7396ba6199|3      |
|63eaa487426f9a7396ba6199|3      |
+------------------------+-------+

While if i make a small delay in between, Spark will be sent separate events instead:

-------------------------------------------
Batch: 20
-------------------------------------------
+------------------------+-------+
|_id                     |version|
+------------------------+-------+
|63eaa487426f9a7396ba6199|4      |
+------------------------+-------+

-------------------------------------------
Batch: 22
-------------------------------------------
+------------------------+-------+
|_id                     |version|
+------------------------+-------+
|63eaa487426f9a7396ba6199|5      |
+------------------------+-------+

Prakul_Agarwal · April 23, 2023, 5:06am

Mongodb spark connector is based on mongodb changestreams. However, when multiple changes to the same document occur in rapid succession, it is possible that the change stream will only capture the latest version of the document.

This behavior is influenced by the full_document option in the change stream. When full_document is set to "updateLookup", the change stream returns the most current version of the document in the full_document field. This means that if multiple updates occur in rapid succession, the change stream may return the latest version of the document for each update event.

It isn’t possible to configure this within Mongodb spark connector. To obtain the desired behavior you can try this:
Use a separate collection to store the history of changes for each document. Instead of updating the document directly, you can insert a new document into the history collection with the updated version and a timestamp.