I belong to a group working on an open data voting project. We have settled on Mongo for the database after looking at a multitude of others for the project, mainly because each of the formats for voting data we get from different states and/or counties is in a different format. So my question is this- we currently have 29 data sources (state voting files) each one with an average of 6.7 million entries. What is the fastest and most accurate way to import this data into Mongo? We have one team member that has some experience with mongo and it is taking forever to import the FL voting record (7.7 million entries). It has been importing for 11 days now and we only have 4.5 million records. The server is a 16 core Xeon with 128GB of RAM and two 2TB SSD drives and is only running at about 18% capacity. The server is running Ubuntu 20 server.
Sure could use some input on if we are doing this right and again, if this is the wrong place please excuse my mistake and point me to the right one?
Some inputs required in-order to help
What is the current load mechanism you are using.? Is the load script just reading row by row in sequence of a voting file, transforming to json document and just inserting to your Mongo DB Collection?
I assume that is the case (or multi-threading) give you mentioned it is just consuming 18% CPU.
To load the data efficiently, you would need to split the file and process them in parallel.
There are two options of parallelism. You can use multi-threading or multi-processing.
You might be using sequential processing or multi-threading and hence the delay. Since the processing is done in single core by spawning multiple threads.
But in reality you need multi-processing, that utilizes all of the 16 cores in your server.
Not sure which language you are using to load to mongo db but in python you have separate commands for doing something multi-threading vs multi-processing. So similar packages would exist in Java or other such interfaces as well.
To confirm my assumption, when your script is running, just run the “top” command to check what CPU # is being used by your load process. If it is using just one number, this confirms your code is not utilizing the 16-core to the fullest.
Disclaimer: I am just a mongo db beginner but have experience with data loading in general. Hope my input helps
Thank you for taking time to answer my question. We are using a multi-threaded NodeJS app where we go in and define the headerfile or, in the absence of one, create one and then import pointing to correct fields. I do not think we are converting to JSON, just loading as a CSV. When I issue “Top” I see the CPU running at 358%, so divide by 16 CPU’s and we get about 22%. I was not going to learn Mongo myself, but now I am getting drawn into this world!
Did some research and seems mongoimport utility is the ideal one for your use case. (Seems to have throughput of loading millions of records in just a few mins)
official doc link:
Stackoverflow example for how to use mongoimport syntax.
Thanks again Anil! I will check these out and see if I can find the bottleneck.
After looking at the import process we discovered the issue was with the “custom built” nodejs app that was the bottleneck. I was able to create an instance of Mongo locally and import 4 million records from a CVS file with a delineated header in about 11 minutes. Thank you for the help!
That is great!!! Glad to be of help