Hello, I am working with a collection that has more than ten million documents and counting. Each record has a few fields with short values as well as a field with an array of 768 floating point numbers. I am trying to figure out what the fastest way to export these records so I can then read them and use them to update entries in a separate SaaS product.
I was originally thinking of using mongoexport to export this collection in documents in chunks of 100,000. I thought that perhaps I could run one mongoexport command per CPU core in order to speed up this operation and I wrote a retry mechanism that re-runs the mongoexport command if it times out or if there is some other kind of intermittent failure.
After doing some testing with smaller collections I deployed this export system on a 64 core EC2 instance on us-east-2. I scaled my MongoDB cluster to an M40. The first two dozen chunks downloaded fine but soon I started seeing failures. Checking the M40 metrics I can see that for one of the shards disk utilization is at 100%, kernel CPU utilization is about 300% and user CPU utilization is at around 60%.
Eventually, even with the retry mechanism in place, mongoexport operations were failing just as they started approaching a complete download of a chunk of 100,000 documents.
- Can multiple concurrent mongoexport operations be run against an M40 cluster? I tried 64 concurrent operations. Should I try a smaller amount? If so, what is the maximum supported amount for M40?
- What causes mongoexport operations to fail with status code 256?
- How would you approach exporting such a large collection?
Thanks so much for your help!