While experimenting with Spark connector I noticed that when changes to a single document in MongoDB are made very fast, only the latest version of a document will be sent to Spark connector. In such cases, very often number of events sent to Spark is correct, but all events contain the latest version of a given document. It that a normal behavior? How can I avoid it, and have the whole change history sent to Spark?
Technical details
I’m using pyspark 3.3.1, connector mongo-spark-connector_2.12:10.1.1, Mongodb 4.4.13
I have spark.mongodb.change.stream.publish.full.document.only
property set to True
.
Example
Sample code can be found here.
For instance, say I’m modifying a collection in Mongo that consists only of docs like (_id, version)
.
Running 2 separate update operations in a row (updating version from 1 to 2, and afterwards from 2 to 3), without any delay, Spark in a single batch receives:
-------------------------------------------
Batch: 10
-------------------------------------------
+------------------------+-------+
|_id |version|
+------------------------+-------+
|63eaa487426f9a7396ba6199|3 |
|63eaa487426f9a7396ba6199|3 |
+------------------------+-------+
While if i make a small delay in between, Spark will be sent separate events instead:
-------------------------------------------
Batch: 20
-------------------------------------------
+------------------------+-------+
|_id |version|
+------------------------+-------+
|63eaa487426f9a7396ba6199|4 |
+------------------------+-------+
-------------------------------------------
Batch: 22
-------------------------------------------
+------------------------+-------+
|_id |version|
+------------------------+-------+
|63eaa487426f9a7396ba6199|5 |
+------------------------+-------+