How to improve the time taken to load data into MongoDB?

K_S · April 27, 2022, 7:12am

Hi,

Problem: To load 100Million data to MongoDB, using the mongoimport command, it takes 10 hours approx. I need to make this insertion operation as fast as possible.

Environment and other details:
I have some 40,000 json files each having 2500-3000 documents in jsonArray. Average size of 1 document is 22KB(I got it from db.stats()) and every document already has the _id field added before I start insertion. There is a total of 100Million documents approx that I want to insert into the MongoDB. Currently I have 7DB VMs and MongoDB is running on containers i.e mongo_router, mongo_config and mongo_shard. Sharding is enabled. I have used the --numInsertionworkers along with mongoimport command and is currently set to 5.

Questions:

How to make this insertion faster?
Increasing the --numInsertionworkers beyond 5 for the insertion of 3000 documents makes the process slower. Why is that so… ? How do I reach a optimal --numInsertionWorkers value?
Will running mongoimport command on the 7DB VMs paralley work? I would be inserting data from 7 json files on the respective 7DB VMs parallely. Would this really make a difference in the insertion timing? Also will inserting the data parallely from different endpoints affect or modify the data in anyway, because i wouldn’t want that to happen.
Does having documents per line in the json file better than having documents in a jsonArray in the json file? I saw there was a 10secs improvement if I don’t use the --jsonArray option but after some 100 runs of mongoimport it was showing the same timing as with --jsonArray
Does the time taken for insertion depend on size of the document being inserted or the number of documents getting inserted or both?
Should I use lesser number of json files with more documents in a single json? Or should I use more json files with even lesser than the current 3000 documents in it? would it help in reducing the time?
Any other options that I need to try please help?

steevej · April 27, 2022, 1:21pm

Please provide more details on the following:

Are the above all running on the same hardware?

How much RAM total?

What type of disks and how are they used by the 7DB?

Where is the mongoimport running compared to the 7DB? One the same host, different host, what is the connection type?

How many shards? 7DB seems inadequate for more than a single shard. You need
3 config servers,
3 * number_of_shards
1 separate server for mongos is preferable
Then if you only have 1 shard, you occur major overhead and no advantage of data distribution.

If you run your 7DB containers on the same hardware, it is very detrimental to your performance. They each fight over shared resources. Much slower that single server.