How to read large datasets from MongoDB collection?

Gaurav_Gupta4 · October 26, 2021, 11:22am

I would like to copy data belonging to particular date from Mongo to HDFS via pyspark. I am getting nearly 100,000 records per day and this results in Out of Memory error for me. I observed that Mongo offers “batchSize” input parameter which as per my understanding will read data partially (in batches) and allow me to write to sink in batches. Applying this option doesn’t work out for me (despite making batchSize quite small).

So please can you share some tips on

How to read data in the dataframe when the result of the query is more than the size of RAM?
Does there exist any reference for best practices and examples (apart from Official doc)?

Robert_Walters · October 26, 2021, 12:15pm

I think the partioner may help. Here is a blog post that describes how to use them. MongoDB partitioners and PySpark. Short introduction to Mongo Spark… | by Tomas Peluritis | Wix Engineering | Medium

Documentation