Hi everyone,
I have a large collection (200 million documents) where I must add a new field, hashedLabelId: as the name suggests, it is the hash of another field of the model.
This processing requires that I do it using some kind of scripting that will be running for days on a VM.
My idea is to:
- build a index on hashedLabelId (in the first moment it will have no documents I guess);
- query for a page of 50/100 documents filtering for hashedLabelId not existing (corresponding query would be something like
{'hashedLabelId': {'$exists': False}}
); - compute locally the hashes and execute a bulk operation of update;
- loop to step 2.
Can this approach work? I fear that the more documents I update, the slower the query {'hashedLabelId': {'$exists': False}}
becomes because documents must be scanned each time.
Thank you in advance.