Fast and reliable way to export a collection with approximately ten million documents?

Hello, I am working with a collection that has more than ten million documents and counting. Each record has a few fields with short values as well as a field with an array of 768 floating point numbers. I am trying to figure out what the fastest way to export these records so I can then read them and use them to update entries in a separate SaaS product.

I was originally thinking of using mongoexport to export this collection in documents in chunks of 100,000. I thought that perhaps I could run one mongoexport command per CPU core in order to speed up this operation and I wrote a retry mechanism that re-runs the mongoexport command if it times out or if there is some other kind of intermittent failure.

After doing some testing with smaller collections I deployed this export system on a 64 core EC2 instance on us-east-2. I scaled my MongoDB cluster to an M40. The first two dozen chunks downloaded fine but soon I started seeing failures. Checking the M40 metrics I can see that for one of the shards disk utilization is at 100%, kernel CPU utilization is about 300% and user CPU utilization is at around 60%.

Eventually, even with the retry mechanism in place, mongoexport operations were failing just as they started approaching a complete download of a chunk of 100,000 documents.

Questions:

  1. Can multiple concurrent mongoexport operations be run against an M40 cluster? I tried 64 concurrent operations. Should I try a smaller amount? If so, what is the maximum supported amount for M40?
  2. What causes mongoexport operations to fail with status code 256?
  3. How would you approach exporting such a large collection?

Thanks so much for your help!

Sorry hoping in. Not my area but this might help others to know: are the documents all under constant change like user data? or some accumulated data taken daily or so.

also knowing the server version along with the error message might help too. that and if all shard has the same cpu/disk/network capabilities (twins) or some has lower capacities.

by the way, lowering the chunk size might need more trips for you but also may solve the problem.

This thought was bugging me about a missing detail in your case: your sharding key being a poorly chosen one along with time-related daily entries can cause a huge amount of data to accumulate on a single shard.

this time data is such an important case that MongoDB has added the ability to create “Time Series Collections” to store them since v5. Time Series — MongoDB Manual

If this is the case for your data, you need to be softer on that range of data while exporting. And you may also try re-sharding and/or converting them to Time Series Collections.

Hi @Yilmaz_Durmaz, thank you so much for your reply.

are the documents all under constant change like user data? or some accumulated data taken daily or so.

These documents are read-only. They are only written to occasionally as part of seldom-run batch data migration oprtaions. Every night additional documents are added.

also knowing the server version along with the error message might help too. that and if all shard has the same cpu/disk/network capabilities (twins) or some has lower capacities.

Server version is 5.0.13. I don’t have the error message from mongoexport but I can potentially rewrite my code to expose that. I am using MongoDB Atlas M40. There is a primary node and two slave nodes. I assume that all have the same capacity. Monitoring shows that really only the primary node is exhibiting heavy disk and CPU utilization.

by the way, lowering the chunk size might need more trips for you but also may solve the problem.

Thank you, I will try that.

Thanks for letting me know about time series collections and the danger of poorly chosen sharding keys. Taking a look at our cluster it doesn’t look like we are using horizontal scaling or sharding. When I say “sharding” I mean the techniques described here: https://www.mongodb.com/docs/manual/sharding/

For 10 million + documents would you recommend horizontal scaling?

No sharded clusters. Check.

Sharding is needed when the total “size” does matter. 768 floating point numbers and other fields do not seem to make more than 10KB for each document. But you are saying “ten million documents” which would correspond to a 100GB size. So the decision is a rough one here for a sharding.

you said you try exporting 100.000 at a time this would correspond to about 1GB which is not much until you said trying 64 of them concurrently. Things go a little blurry here. M40 should deal with these sizes and maybe also the load from 64 operations as well.

But in return, you need bandwidth to get responses from each of those operations before timeouts start to happen. depending on the speed (plus other settings) anything can happen.

It would help if you give a full line of what is that error with “256” code. I could not find a direct reference and traversing source code is not easy with just that number, because there are lots of other keywords having 256 in them; especially “SCRAM-SHA-256” takes a lot of space.

assuming your data includes timestamps, converting to time series is still a valid suggestion for your documents for future purposes.

also, I would like to hear the result if you try smaller numbers when fetching data.