I am new to mongoDB and I have a Spark App that writes about 115 M Documents to my MongoDB, that is running on t3.2xl machine (8 cores, 32 Gb memory, gp3-EBS volume with 3000 IOPS baseline).
My Spark app is running on an EMR cluster with 4 workers (r6g.16xl: 64 cores, 488 Gb memory) and reads the data from S3, does some minor transformations and then writes to my MongoDB. The storage size of the collection is in MongoDB about 15Gb, thats about 1Kb per doc, in raw JSON format it’s about 80 Gb I think.
The data insertion takes about 9 to 10 mins and the cpu usage of my EMR cluster is less than 40% on each node. I also did some test runs with just 2 workers, the cpu usage was a bit higher, but still took bout 10 mins. So I am pretty sure, that the issue is with MongoDB.
The MongoDB’s CPU usage is about 60-70% while insertion and the IOPS at max 500. But in MongoDB Compass I can see at slowest operation section that some ops have a waiting time bout 20000ms regularly, that looks like a problem but I dont see what is limiting my MongoDB instance …
the average operation per sec is bout 400.0k.
yes, single node. the schema is the same for every document. the data is basically an activity log, so each data point is one document, which represents an action.
This is where the issue is. MongoDB documents are not really a one to one with a row in a relational database. Here is a good youtube video describing the differences MongoDB Schema Design Best Practices - YouTube
Also note that MongoDB scales horizontal through Sharding. If you’ve bucketed your data and are hitting insertion issues consider sharding. https://www.mongodb.com/docs/manual/sharding/