Insert billions of data into mongodb

inokiyo · October 15, 2022, 3:55pm

Hi, everyone
I tried to use golang to insert about 700 billions of documents into one collection on my computer with 2.2 GHz 4 Core Intel Core i7, 16Gb RAM

my program:
I perform concurrent method which have 50 gorotines, every gorotines will get one documents from the unbuffered channel and insert one document to the database using InsertOne() func

my result:
After let me computer running for a whole night, I wake up and find about 400 millions was inserted on mongodb, while in the mid night, I woke up and find the numbers of documents is about 200 miilions.
I am a new-bie and want to tell you guys about more things i discover and confused.

my confusion:
I think it’s because the number is to big, the number of collection showed in the compass is N/A, so I go to mongosh, use db.collection.countcollections() to get the number, however, it usually takes mins to respond, so that I can not directly know what’s going on in the database.

my question:
Is there any better way to insert the millions of data quickly? use other func like buld.insert() or InsertMany()?
why the performance seems to be lower when there’re more documents?
the perfromance of mongodb is slow when it counting millions of documents, is it normal?

inokiyo · October 15, 2022, 4:45pm

Sorry, it’s 700 millions, not billions

kevinadi · October 17, 2022, 12:12am

Hi @inokiyo welcome to the community!

I would say that the insert speed depends primarily on the hardware. Although you can create a highly parallel script to do the insertion, there’s only so many cores that process the work. A larger server with more processing power will definitely be able to do this faster.

If applicable, using an official tool such as mongoimport may be beneficial, since it was written with the best insert performance in mind (given the hardware).

the perfromance of mongodb is slow when it counting millions of documents, is it normal?

If it still does a heavy insertion work when you execute this command, then you’re asking the server to do even more work, so yes it will be slow

If you don’t need high precision, you might be able to use db.collection.estimatedDocumentCount() that returns the number using the collection’s metadata instead, which will be less precise but would not impose extra work on the server.

Best regards
Kevin

Matt_Dale · November 16, 2022, 6:21am

That’s odd that it takes so long to count the number of collections. Is it possible that you’re inserting many documents in many collections instead of inserting many documents into one collection? Creating more than 10,000 collections is known to cause serious performance problems.

As far as performance when there are many documents in a collection, you may need to add indexes to the collection to prevent having to scan many documents. What indexes you need to add depends on what operations you’re trying to do.