What is the best way to update a large collection with data from another collection?

Hi everyone,

I’m fairly new to MongoDB and struggling to understand the best way to update a large collection with data from another collection.

We have a collection with ~6m items that requires field within each document to be updated to that of an item in a related collection.

Initially I wrote an aggregate pipeline which built up the required data via $lookup and used $out to update the collection but it took over 90mins to run locally which isn’t ideal and I suspect is due to the items also containing a lot of data.

I started to look at using a cursor and forEach but it still seemed very slow and getting debug output was difficult.

Can anyone advise how they would handle a large update such as this? I’m thinking the best way would be to prepare a json payload for use with bulkWrite?

Hi @Phunky,

I think you are right for the initial build I would suggest doing a range query on an indexed field, or collection scan depanding on best read logic and deviding the data into a unique based bulk chunks.

Than those bulk chunks can be passed in parallel to multiple write threads to run insert/update simultaneously based on unique key filter (make sure it is indexed on the target collection). Please make sure to use w: majority to keep replica members in sync and avoid cache pressure on primary.

To keep this collection up to dat I would suggest using Atlas triggers if you are in atlas or a changestream module so that you will stream changes as they come from the source collection.

Best
Pavel

1 Like

Thanks for the response @Pavel_Duchovny, i’ll take those points into consideration.

Thankfully this is just a one-time task we need to run to clean up some problematic data and restructure our existing data structures.