Data Insertion is slow (115M docs, about 80Gb raw JSON, 15GB in MongoDB)

Elias_Strauss · December 21, 2022, 12:38pm

Hey,

I am new to mongoDB and I have a Spark App that writes about 115 M Documents to my MongoDB, that is running on t3.2xl machine (8 cores, 32 Gb memory, gp3-EBS volume with 3000 IOPS baseline).

My Spark app is running on an EMR cluster with 4 workers (r6g.16xl: 64 cores, 488 Gb memory) and reads the data from S3, does some minor transformations and then writes to my MongoDB. The storage size of the collection is in MongoDB about 15Gb, thats about 1Kb per doc, in raw JSON format it’s about 80 Gb I think.
The data insertion takes about 9 to 10 mins and the cpu usage of my EMR cluster is less than 40% on each node. I also did some test runs with just 2 workers, the cpu usage was a bit higher, but still took bout 10 mins. So I am pretty sure, that the issue is with MongoDB.
The MongoDB’s CPU usage is about 60-70% while insertion and the IOPS at max 500. But in MongoDB Compass I can see at slowest operation section that some ops have a waiting time bout 20000ms regularly, that looks like a problem but I dont see what is limiting my MongoDB instance …
the average operation per sec is bout 400.0k.

do u have any suggestions? thanks!

Robert_Walters · December 22, 2022, 1:00pm

Is your MongoDB a single node? Given the small size of your documents, is your schema such that you have one document per data point ?

Elias_Strauss · December 23, 2022, 12:07pm

yes, single node. the schema is the same for every document. the data is basically an activity log, so each data point is one document, which represents an action.

Robert_Walters · December 24, 2022, 2:38pm

This is where the issue is. MongoDB documents are not really a one to one with a row in a relational database. Here is a good youtube video describing the differences MongoDB Schema Design Best Practices - YouTube

That said, if you have data that can’t be bucketed such as IoT/Time-series data you can create a time-series collection. https://www.mongodb.com/docs/v5.0/core/timeseries-collections/ This has some restrictions noted in the documentation.

Also note that MongoDB scales horizontal through Sharding. If you’ve bucketed your data and are hitting insertion issues consider sharding. https://www.mongodb.com/docs/manual/sharding/

Elias_Strauss · December 24, 2022, 3:02pm

Alright, I will check this out. Thanks for the help. Happy Holidays!

system · December 29, 2022, 3:02pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.