I would like to do a batch operation that does something to every document in a large collection. The batch will be run using multiple workers, each with an independent connection to mongodb.
Each batch worker knows the total number of workers (N) and its index (I). So, N might be 4, and then there will be worker 0, 1, 2 and 3.
I’d like to query a subset of the documents in the collection from each worker.
I have a few ideas about how this might be accomplished:
1 populate every document in the collection with a random positive integer, and then do something like randomInt: {$mod: [N, I]}. The downside here is that we have to perform a migration to assign a randomInt to every doc, and modify our application logic to make sure they are inserted at creation time and maintained during updates.
2 if N is 16 (or 8, or 4) I could do something like db.activityMeta.count({$where: “hex_md5(this._id.toString())[23] == ‘g’“}) (use md5 to hash the _id and then use $where to partition by the last character). The downside here is that this is quite awkward for different values of N, and the $where clause cannot use the index, and so has to evaluate the condition on every doc for every worker (thus, on the db side, doing N simultaneous collection scans).
3 do a pre-emptive $bucketAuto aggregation, then pass an _id range to each individual worker. I think this adds a considerable amount of complexity in running the initial query, passing the ranges, and also due to the timing wrt. what happens to documents that are inserted after the initial query.
I am curious if I am missing any other ways to accomplish this? Ideally I’d be able to do something like create a cursor for “hash(_id) % N == I”, but in a way that could leverage the _id index, without having to modify documents in the collection.