Facing Java Heap Space OOM issue when large data is read on limited memory

Basant_Gurung · October 19, 2023, 9:20am

Thanks for your response, Ross. That clarifies the root cause. In my case, the schema is unknown and it may or may not be the same for all the documents in the target collection.
Since I have to process TBs of such kind of data in the future, I am trying to read the documents in batches so that it doesn’t run OOM. The physical memory is limited, If I allocate let’s say 4GB memory for the spark driver, it can still throw OOM at some point. A batch of 300 documents could be of 300 Kb or 3GB. Also, it’s not necessary that reading documents of 100 MB size will consume exactly 100MB of heap space. Creating small batches could lead to underutilization of the resources and large batches would lead to OOM.