Can we get an RDD of documents with mongo-spark-connector 10.1.1?

Basant_Gurung · January 4, 2024, 7:38am

I am working on a spark application that involves reading data from Mongo where the collections may have millions of data with different structures.
Recently, I have migrated from using mongo-spark-connector 2.4.4 to 10.1.1.
The connector 10.1.1 samples 1000 (default) documents to infer the schema for creating the data frame.
The connector 2.4.4 used to have MongoSpark.load() which returns MongoRDD[Document].
While dealing with collections of millions of documents with different structures, using 10.1.1 could lead to data loss as the schema would be inferred using only 1000 docs. We cannot set this value to a million.
Is there a way to overcome this problem? Or is there still a way to get an RDD of documents like it used to have earlier?

Prakul_Agarwal · February 2, 2024, 7:41am

The support for RDD is not exposed in the v10.x of Spark Connector Drivers. you still see a dataframe inferred using https://www.mongodb.com/docs/spark-connector/current/batch-mode/batch-read/#schema-inference

You have two options - (a) Explicitly specify the schema instead of relying on the auto-inference, (b) Increase the number of documents that are sampled for auto-inferring the schema (however this will lead to proportional increase in initial time latency) .

sampleSize The number of documents to sample from the collection when inferring the schema.

Basant_Gurung · February 20, 2024, 1:06pm

Thanks @Prakul_Agarwal for the response!

system · February 27, 2024, 7:41am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.