spark.mongodb.read.sampleSize is not working in mongo spark connector r10.1.1

I have a collection of 1000 documents of 1 MB avg doc size. I want to fetch 200 random docs. I am using the “sampleSize” property as follows. But it is fetching the entire collection. Please help! why is the “sampleSize” configuration not working? Is there any issue with the code?

    val spark = SparkSession.builder()
      .appName("Spark-MongoDB-Connector-Tests-001")
      .config("spark.mongodb.read.connection.uri", "mongodb://x:x@localhost:27017/")
      .config("spark.mongodb.read.database", "mydb")
      .config("spark.mongodb.read.collection", "data_1000_docs_1mb_each")
      .config("spark.mongodb.read.sampleSize", "200")
      .getOrCreate()

      spark.read.format("mongodb")
        .load()
        .toJSON.count()

The sampleSize option is just the number of docs that are sampled for inferring the schema. otherwise you have to setup the schema of the source documents (from where you would be reading) explicitly.

For fetching 200 random docs you will have to perform that logic in application layer. Let me know if that helped answer your question

Thanks @Prakul_Agarwal for your response. Now, I understand the purpose of sampleSize. For fetching 200 random documents, I am now using $sample pipeline with the spark connector. It works like a charm! Thanks again for your time. :slight_smile:

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.