Hello everyone! I’m about to migrate the documents of the collections i have to timeseries collection i want to create. Within these documents (hundreds of millions) there is a certain number of duplicated documents that i want to remove and i’m trying to find out which operation i should take care of first between restoring data into the new timeseries collections and cleaning data. I’ve done some testing so far and what i learned is that the restore operation is really slow and it would be better doing it with less data but the aggregation pipeline used to remove duplicates is a lot faster when working on a timeseries collection with respect to a normal collection. Any suggestions?
Welcome to the MongoDB Community!
There are some approaches for removing duplicate documents from the MongoDB collections:
- You can use the
$groupstage to group documents by the fields that make them unique, and the
$outstage to output only one document per unique group into a new collection.
- If the collection is large, you can shard it and run the de-duplication on each shard to parallelize it. After that merge the results back into a collection.
Here are some pointers for optimizing the de-duplication process:
- Index the grouping fields for faster processing.
- De-duplicate in batches if the collection is too large to process at once.
May I ask which specific version of MongoDB you have used for testing?