Taking time when writing 3 million records to the DB

Prajwal_praveen · November 23, 2023, 6:42am

I’m trying to bulkWrite(update / upsert) 3.6 million records to a collection, looks like it takes upto 15 mins.
Is this expected?

We have also done some optimisations as per these recommendations: https://www.mongodb.com/docs/manual/core/write-performance/

Note:

We are using MongoDB Node Driver to invoke the bulkWrite.
We have very minimal indexes.
Using an M20 Cluster.
Each bulk write will consist of 10000 records in a batch, which are executing in series.

Is there any way to enhance the performance of this bulk write ? Would switching to another Mongo Driver help here ? since the driver only invokes the bulkWrite.

Thanks

steevej · November 24, 2023, 2:25am

It looks like you are trying to throttle the whole operation. Why only 10000 documents? Why in series? From where (hardware and network) are you calling bulkWrite? 3.6 million of documents of 1Kb each document is a lot different than 3.6 million of documents of 1Mb. Have you tried different work distribution strategies? Less or more document per bulkWrite, in parallel rather than in series. Knowing everything that you tried would save us a lot of time. Because proposing something you already tried and that you already know does not work is a waste of time. And what is you use-case? Is is frequent that you must insert 3.6M documents? Or you are trying to stress test your system by simulating 3.6M people doing 1 insert.

Prajwal_praveen · November 24, 2023, 5:50am

Hi @steevej,

Thank you for your response.

Why only 10000 documents?
It was mentioned here that the bulk write has a maxLimit of 100000. And we noticed that the batch size of 100000 performed poorly when compared to 10000.

Why in series?
My bad, I missed one major point. We are processing 10 batches in parallel.
I’ve shared below different configurations of Parallel Ops and Batch size along with their corresponding time taken to upsert 100000 records (we reduced the number of records, just to see how it would behave with varying configurations).

MongoDB Cluster Tier	No. of Parallel Ops	Batch Size	Time taken to Insert 100000 records
M20	10	2000	23049
M20	10	1000	23725
M20	10	2500	24443
M20	5	1000	27469
M20	5	2000	27578
M20	10	5000	27611
M20	5	2500	28427
M20	5	10000	29561
M20	10	10000	30674
M20	5	5000	37241

From where (hardware and network) are you calling bulkWrite?
I’m using M1 Macbook Pro, although this bulk write is expected to be initiated via an AWS lambda (Amazon Linux). We have tested this action on the lambda and it seems to be behaving the same way there as well.

Here are the configurations for the Macbook

Model Name:	MacBook Pro
Model Identifier:	MacBookPro17,1
Total Number of Cores:	8 (4 performance and 4 efficiency)
Memory:	16 GB

Network: 200Mbps

Average size of each document: 428B
Screenshot 2023-11-24 at 11.00.12 AM

Have you tried different work distribution strategies
I’ve shared the different configs we have tried above.

And what is you use-case? Is is frequent that you must insert 3.6M documents?
This is not a very frequent action, its expected to happen maybe once or twice in a day. We are building an LMS, our clients would have 10K students and 300 contents (in each course). We need to insert 3 million records to the DB for an architectural use case(when the contents are created and published all at once). Although this is the worst case scenario, we need our system to support this number.

Let me know if you need anything else.

Thanks,

steevej · November 24, 2023, 2:20pm

It looks like you are I/O bound.

Most likely caused by the fact that you are doing performance testing on a cluster type that is wrong for this purpose. See the following note about M20

Do you have any metrics on how the cluster performs? CPU usage, disk I/O, RAM …

So basically, you are expecting good performance out of something configured for low traffic.

So before trying to optimize your code you have to make sure you are using the proper infrastructure.

Prajwal_praveen · November 24, 2023, 4:20pm

We checked the metrics, It did seem like CPU usage was touching 100% at times. We will upgrade to a higher tier cluster and monitor the performance.

Also, had some questions related to the code size optimisations that we could do once we opt a higher tier.

Do you recommend processing in batches ?
How many batches should we ideally process in parallel ?
Would the Mongo Driver play any significant role with regards to a bulk write (I had read about the driver performance comparison here). If yes, which driver client do you recommend ?

Thanks for the input @steevej. This has been helpful.

steevej · November 27, 2023, 2:14pm

I do not think that the driver would make a big difference during bulkWrite as the bulk of the work will be done in the server. About the comparison link you share, one issue I see reading diagonally is

The benchmark was ran in my personal laptop (i7–6th generation, 24 GB RAM), against a dockerized MongoDB 4.2 also running locally, so network times are negligible but on the other hand we have to take in account that the same machine has to cope with running the DB server and the client concurrently.

Yes network time are negligible but context switches increases. You have to test with an architecture that is closed to what you want to implement. If running the client and the server on the same machine is not your use-case, then your results may differ.

It all depends of your use-cases. I think your approach of testing different number of batches vs size of batches is appropriate and should provide you with the correct configuration one you figure out and eliminate your bottlenecks until you reach the performance you want. In your table, there is not an order of magnitude differences between each configuration and the fact that you are using 100% CPU indicates that the bottleneck was indeed the M20. Only running the same tests on a bigger machine can provide you with the insight on the best config.

There is no easy answer except the easiest of them all, it depends.