Replace hundreds of thousands of documents with few changes

Hi,

I need to replace hundreds of thousands of documents in a collection every few minutes with a new list of documents. The actual changes between the old documents and the new documents are very small - might be just a few inserts, a few deletes and a few updates. Each document contains a number field, a string field, and an object field.

Simply deleting all old documents and inserting all new documents would be slow.

I also tried loading all old documents into the client application, and compare the two lists of documents in-app, which is putting a lot of pressure on the client application.

Any advice on how to solve this problem? Thanks!

Assume you don’t know which doc has been modified. The only thing you know is that some are changed, some are deleted, and some are inserted.

In that case you can try comparing the two lists and just ask mongodb to to the “delta” work. But as you said, this requires some computing resources and memory on a host. If this can not be done, i would simply suggest to remove all old ones and insert all new ones.

that being said, if you can change the source data generation side so that you know the “delta”, the work will be much easier.

Thanks for your reply! Assume I can’t know the diff from the source data generation, which of these two would require less bandwidth on CPU, network and memory?

  1. Comparing the two lists in mongodb
  2. Comparing the two lists in client
  3. Removing all old ones and inserting all new ones

Essentially, no.1 and no.3 are the same. As before you can compare, you will have to insert all news anyway.

if client host machine can store all those old and new docs in memory, i would then suggest no.2. Using a database for the comparison work can be more challenging since it not only involves memory must also disk, locking, concurrency control, etc.

Using ram to store everything and rely on CPU purely to do the work is generally faster

No. 3 will be really bad in terms of indexes compared to the others if the you have a lot of indexes and if you do not update indexed fields often.

No. 2 is bad in terms of I/O since one list is downloaded and one is uploaded.

No. 1 can be achieve somewhat easily by uploading the new list in a temporary collection and then use a $merge stage to perform the updates and inserts.

Handling the delete is a different story in all cases, How do you know which document you need to delete? Unless of course you delete all documents in the original collection that are not present in the updated list.

The “use a $merge stage to perform the updates and inserts.” approach sounds interesting. Would that be resource intense on Mongodb though if it involves comparing the documents between the two collections?

No matter where it is done comparing the documents from 2 lists is the same complexity. Usually, your server is better equip to handle load.

Anyway downloading the original from the server to compare on the client also involves the server for the download and then for the update upload. Bandwidth might be an issue.

What is not clear is how you determine the documents to delete. Unless of course, if you assume that documents not present in the updated list are to be deleted.

No matter what, if your are not sure of what provides the best performance for your use-case, the best solution is to setup benchmarks and test and then chose the best solution for your use-case. If you are not willing to write benchmarks to compare your different approaches, simply start by implementing the simplest one and fix it only if it is problematic.