I would like to copy data belonging to particular date from Mongo to HDFS via pyspark. I am getting nearly 100,000 records per day and this results in Out of Memory error for me. I observed that Mongo offers “batchSize” input parameter which as per my understanding will read data partially (in batches) and allow me to write to sink in batches. Applying this option doesn’t work out for me (despite making batchSize quite small).
So please can you share some tips on
- How to read data in the dataframe when the result of the query is more than the size of RAM?
- Does there exist any reference for best practices and examples (apart from Official doc)?