Understanding how spark partitions are created

Basant_Gurung · October 5, 2023, 1:54pm

I have a collection of 1,000 documents on a standalone Mongo DB with an average document size of 1.04Mb. I am using mongo-spark connector v10.1.1 to read the data by using aggregate pipeline $sample as 120. “_id” is the default generated object id.

      spark.read.format("mongodb")
        .options(configMap)
        .load()
        .toJSON

The config map in the above code snippet is as follows:

(partitioner,com.mongodb.spark.sql.connector.read.partitioner.PaginateBySizePartitioner)
(aggregation.pipeline,[{“$sample”: {“size”: 120}}])
(partitioner.options.partition.field,_id)
(connection.uri,mongodb://x:x@localhost:27017/mydb.mydata?readPreference=primaryPreferred&authSource=admin&authMechanism=SCRAM-SHA-1)
(partitioner.options.partition.size,64)

The result I am getting is as follows :
Run #1:
-----------------> Fetch $sample 120 documents …
-----------------> Spark Partitions: 6
-----------------> Partition 0: 120 documents total size: 122.84 MB
-----------------> Partition 1: 120 documents total size: 122.88 MB
-----------------> Partition 2: 120 documents total size: 122.90 MB
-----------------> Partition 3: 73 documents total size: 74.69 MB
-----------------> Partition 4: 61 documents total size: 62.45 MB
-----------------> Partition 5: 11 documents total size: 11.27 MB
-----------------> Actual Documents fetched: 505
Run #2:
-----------------> Fetch $sample 120 documents …
-----------------> Spark Partitions: 6
-----------------> Partition 0: 120 documents total size: 122.82 MB
-----------------> Partition 1: 120 documents total size: 122.84 MB
-----------------> Partition 2: 120 documents total size: 122.89 MB
-----------------> Partition 3: 64 documents total size: 65.47 MB
-----------------> Partition 4: 61 documents total size: 62.46 MB
-----------------> Partition 5: 1 documents total size: 1.02 MB
-----------------> Actual Documents fetched: 486
Run #3:
-----------------> Fetch $sample 120 documents …
-----------------> Spark Partitions: 5
-----------------> Partition 0: 120 documents total size: 122.86 MB
-----------------> Partition 1: 120 documents total size: 122.82 MB
-----------------> Partition 2: 101 documents total size: 103.40 MB
-----------------> Partition 3: 61 documents total size: 62.42 MB
-----------------> Partition 4: 46 documents total size: 47.11 MB
-----------------> Actual Documents fetched: 448

Please help me understand this behavior. Why does the connector create 5-6 partitions every time in this case? How is it calculating or ending up with 5-6 partitions in this case? I was expecting two partitions to be created for 120 sample documents since per partition size is set to 64mb.

I wanted to read 120 random documents here. But unfortunately, it applies $random per partition and the result count gets multiplied. Please suggest solution for this guys!