How to scan a collection in parallel from multiple clients?

steevej · April 19, 2022, 1:34pm

One important thing to know is that when a document is updated it is eventually written in whole to disk. See Will NaN entries consume much space? - #4 by Stennie_X.

I wanted to bring that because a big migration (doing something to every document) like that is most likely disk I/0 intensive.

So update a document with a random number and use this number to assign a worker, will result in writing the document twice. Once with the random number update and another for the update by the worker.

As for leveraging _id with your find() from your 2nd post, you will need to project because find() returns the whole document so the whole document will need to be read.

If

is a one time thing you might consider leveraging the flexible schema nature of MongoDB and only migrate documents when they are needed by another single document update use-case. The following is useful
Building with Patterns: The Schema Versioning Pattern | MongoDB Blog.

To create batches to may always use sort().skip().limit(), more or less like paging. Example, with 4 workers:

batch_number = n
batch_size = 1000 
worker_number = m
number_of_workers = 4
skipped_documents = ( number_of_workers * batch_number + worker_number ) * batch_size;
// skipped_documents = (4 * n + m) * 1000
batched_documents =
  c.sort( {_id:1} ).
     projection( {_id:1} ).
     skip( skipped_documents ).
     limit( batch_size )

Documents will be batched like
batch 0 worker 0 updates document 0 to batch_size - 1
batch 0 worker 1 updates document 1 * batch_size to 1 * batch_size - 1
batch 0 worker 2 updates document 2 * batch_size to 3 * batch_size - 1
batch 0 worker 3 updates document 3 * batch_size to 4 * batch_size - 1
batch 1 worker 0 updates document 4 * batch_size to 5 * batch_size - 1
batch n worker m updates document ( 4 * n + m ) * batch_size to ( 4 * n + m + 1 ) * batch_size - 1