Hi,
Problem: To load 100Million data to MongoDB, using the mongoimport command, it takes 10 hours approx. I need to make this insertion operation as fast as possible.
Environment and other details:
I have some 40,000 json files each having 2500-3000 documents in jsonArray. Average size of 1 document is 22KB(I got it from db.stats()) and every document already has the _id field added before I start insertion. There is a total of 100Million documents approx that I want to insert into the MongoDB. Currently I have 7DB VMs and MongoDB is running on containers i.e mongo_router, mongo_config and mongo_shard. Sharding is enabled. I have used the --numInsertionworkers along with mongoimport command and is currently set to 5.
Questions:
- How to make this insertion faster?
- Increasing the --numInsertionworkers beyond 5 for the insertion of 3000 documents makes the process slower. Why is that so… ? How do I reach a optimal --numInsertionWorkers value?
- Will running mongoimport command on the 7DB VMs paralley work? I would be inserting data from 7 json files on the respective 7DB VMs parallely. Would this really make a difference in the insertion timing? Also will inserting the data parallely from different endpoints affect or modify the data in anyway, because i wouldn’t want that to happen.
- Does having documents per line in the json file better than having documents in a jsonArray in the json file? I saw there was a 10secs improvement if I don’t use the --jsonArray option but after some 100 runs of mongoimport it was showing the same timing as with --jsonArray
- Does the time taken for insertion depend on size of the document being inserted or the number of documents getting inserted or both?
- Should I use lesser number of json files with more documents in a single json? Or should I use more json files with even lesser than the current 3000 documents in it? would it help in reducing the time?
- Any other options that I need to try please help?