Data migration to timeseries

Hello everyone! I’m about to migrate the documents of the collections i have to timeseries collection i want to create. Within these documents (hundreds of millions) there is a certain number of duplicated documents that i want to remove and i’m trying to find out which operation i should take care of first between restoring data into the new timeseries collections and cleaning data. I’ve done some testing so far and what i learned is that the restore operation is really slow and it would be better doing it with less data but the aggregation pipeline used to remove duplicates is a lot faster when working on a timeseries collection with respect to a normal collection. Any suggestions?

Hi @Umberto_Casaburi,

Welcome to the MongoDB Community!

There are some approaches for removing duplicate documents from the MongoDB collections:

  • You can use the $group stage to group documents by the fields that make them unique, and the $out stage to output only one document per unique group into a new collection.

    In the current version of MongoDB, you can’t do $out operator to the time-series collection. However, this capability is planned for a future version.

  • If the collection is large, you can shard it and run the de-duplication on each shard to parallelize it. After that merge the results back into a collection.

Here are some pointers for optimizing the de-duplication process:

  • Index the grouping fields for faster processing.
  • De-duplicate in batches if the collection is too large to process at once.

May I ask which specific version of MongoDB you have used for testing?

Regards,
Kushagra