Facing Java Heap Space OOM issue when large data is read on limited memory

Ross_Lawley · October 19, 2023, 3:54pm

So sampleSize is important in that directly relates to the number of documents used to infer the schema. If you chose 1 - then only a single document’s schema would be used. If your data is mixed and many documents have different shapes, then a larger sample will be required. As the sample is randomly selected from the collection - you need a relatively large size to ensure a representative sample.

When reading from the collection the documents are then shaped into the schema. So not having a schema that is representative of the data is problematic as data could be missed or type errors can occur converting into the corresponding Spark type.

Ross