Split documents equally across multiple processes

Marek_Kedziora · March 6, 2024, 10:00am

What is the best approach for a large collection (>millions of documents) processed in batch through multiple processes? How to divide documents into approximately equal amounts without adding additional fields. The division of documents is not important for processing, what is important is similar size (equal processing time).

Is it possible to perform find using a hash function with modulo on a complex _id field?

John_Sewell · March 6, 2024, 1:21pm

What are you using for your _id field? If it’s just a regular objectId then you can break down based on that and just do a straight compare to get in batches.

Kobe_W · March 6, 2024, 9:21pm

Maybe you can explain your use case so that we understand better what your goal is.

In large batch processing case, the bottleneck is supposed to be network bandwidth and/or mongodb server load, so i’m not sure how using multiple client processes can help here.

John_Sewell · March 7, 2024, 8:32am

As an example of what we do, when running massive updates (close to 1B documents) we split the batches up into year or month of year groups, running over them without trying to do an update on all documents at once and then putting a large strain on the server.

As Kabe_W said, what’s your use case? Are you just using mongo updates to update a document or have a process that does a bunch of processing and then writes the updates back to the document (say calling 3rd party APIs or doing a complex financial calculation on a document i.e. pricing), before writing the data back to Mongo?

If you have the _id field as an object ID then that has a date/time embedded and could be used for filtering and establishing processing boundaries.

You could just get 100,000 documents, sorted by the _id and then get the next set based on the last ID received…and repeat. Then you always have a know number of documents being passed to a processes.

Kobe_W · March 7, 2024, 5:53pm

Ok, i thought by “process” OP means using multiple client application "process"s to do the update. Seems i was wrong on that.

John_Sewell · March 7, 2024, 6:24pm

Could be either! Both could be valid scenarios.