spark.mongodb.read.sampleSize is not working in mongo spark connector r10.1.1

Basant_Gurung · October 9, 2023, 5:00am

I have a collection of 1000 documents of 1 MB avg doc size. I want to fetch 200 random docs. I am using the “sampleSize” property as follows. But it is fetching the entire collection. Please help! why is the “sampleSize” configuration not working? Is there any issue with the code?

    val spark = SparkSession.builder()
      .appName("Spark-MongoDB-Connector-Tests-001")
      .config("spark.mongodb.read.connection.uri", "mongodb://x:x@localhost:27017/")
      .config("spark.mongodb.read.database", "mydb")
      .config("spark.mongodb.read.collection", "data_1000_docs_1mb_each")
      .config("spark.mongodb.read.sampleSize", "200")
      .getOrCreate()

      spark.read.format("mongodb")
        .load()
        .toJSON.count()

Prakul_Agarwal · October 19, 2023, 11:54am

The sampleSize option is just the number of docs that are sampled for inferring the schema. otherwise you have to setup the schema of the source documents (from where you would be reading) explicitly.

For fetching 200 random docs you will have to perform that logic in application layer. Let me know if that helped answer your question

Basant_Gurung · October 22, 2023, 5:00am

Thanks @Prakul_Agarwal for your response. Now, I understand the purpose of sampleSize. For fetching 200 random documents, I am now using $sample pipeline with the spark connector. It works like a charm! Thanks again for your time.

system · October 27, 2023, 5:01am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.