$out vs mongodump then mongorestore benchmarking

If i have 90 million records in a collection and i need to copy 50 million records of them to another collection based on a condition, is $out a good solution or i would kill my live server since the 50 million records will be saved temporary in the RAM before inserting?
I know the best way is to stream the data and insert it using mongo driver but i’m searching for a direct query.
Finally, if i use mongodump with -q to add the condition then mongorestore to the new collection, it should be a good solution right?

If you use $out, then yes you might overload the server because the server will try to do it the fastest as it can.

If you use mongodump, then your network to the server might be overloaded.

If your server/cluster is not in production, then you do not really care.

If it is, I suggest that you use batch your 50 millions documents into something like 50 batches of 1 millions documents that you distribute over time so that you do not overload the server/cluster or the network. You could then use mongodump, but I would prefer $merge so that network bandwidth is preserve.

Thanks for your reply, so $out inside an aggregate query will store the 50M records in the RAM then will do the insert correct?
$merge is better than $out in terms of RAM and network?
As a conclusion, streaming from mongo then inserting is the optimal solution correct?

More or less but not as straight forward. The details are not really important for the discussion.

It is not what I wrote. Both $merge and $out are better than mongodump network wise. The advantage of $merge is better than $out because it allows you to batch the work and distribute it overtime.

The distribution of the work in time with batches of documents is best to keep the cluster performances.

1 Like

A pipeline with $out will NOT attempt to hold all the intermediate data in RAM (unless you use a $group or $sort stage in your pipeline with certain caveats). The data will be streamed into the target collection for $out. See the “Pipeline Performance Considerations” chapter of the Practical MongoDB Aggregations book for more information about streaming vs blocking.

If you are populating a new collection from scratch, $out will be marginally faster than $merge as $merge has to perform more checks for every record added. However, $out only works for unsharded collections, so if you are writing to a sharded collection, you will need to use $merge.

2 Likes

Thanks @Paul_Done, that was really helpful. it’s exactly what i was looking for.