How to add a field to a very large collection

StefanoF · October 22, 2024, 11:40am

Hi everyone,
I have a large collection (200 million documents) where I must add a new field, hashedLabelId: as the name suggests, it is the hash of another field of the model.
This processing requires that I do it using some kind of scripting that will be running for days on a VM.
My idea is to:

build a index on hashedLabelId (in the first moment it will have no documents I guess);
query for a page of 50/100 documents filtering for hashedLabelId not existing (corresponding query would be something like {'hashedLabelId': {'$exists': False}});
compute locally the hashes and execute a bulk operation of update;
loop to step 2.

Can this approach work? I fear that the more documents I update, the slower the query {'hashedLabelId': {'$exists': False}} becomes because documents must be scanned each time.

Thank you in advance.