MongoDB remove duplicate with millions of records

Paul_N_A1 · August 2, 2022, 6:29am

I’m trying to remove duplicates of documents, in the collection the collection have millions of records and duplicates data I assume have 100k records. I did use aggregation to delete those duplicates, but it is slow and is not ideal when we deploy it to prod. Is there a better and fast way to remove duplicates in a collection?

What I have tried so far:

db.mycollection.aggregate([
{"$match":{
  "create_date_audit":{
      $gte: ISODate('2022-07-25T18:27:56.084+00:00'),
      $lte: ISODate('2022-07-26T20:15:50.561+00:00')
  }
}},
{"$sort":{
  _id: -1
}},
{"$group":{
  _id: {
    notification_id: '$notifId',
    empId: '$empId',
    date: '$date'
  },
  dups: {
    $push: '$_id'
  },
  creationTimestamp: {
    $push: '$create_date'
  },
  count: {
    $sum: 1
  }
}},
{"$match":{
  _id: {
    $ne: null
  },
  count: {
    $gt: 1
  }
}},
{"$sort":{
  create_date: -1
}},
], { allowDiskUse: true }).forEach(function(doc) {    
    db.mycollection.deleteMany({_id : {doc.dups[0]});  
})```

turivishal · August 2, 2022, 2:44pm

I don’t know better than your aggregation query, if this is a one-time process and you want to do this faster then I would suggest you can do this operation on your local machine in the MongoDB server after taking dump of the collection and you can restore it after the successful operation.

steevej · August 2, 2022, 3:32pm

As mentioned by @turivishal, if this is a one-time process vs a regular use-case will most likely be different. So it would be nice to know.

For continuously preventing duplicates, you might want to create a unique index on notifId,empId and date.

I am not sure about your deleteMany. It looks like you might delete only one of the duplicates, because you are deleting _id:doc.dups[0]. It might not be an issue if you know you only have 1 duplicate. But the use of $push makes me think that you might have many.

The final $sort seems useless since you do not have a field named create_date after the $group stage.

To speedup a little bit, you might remove the count in your $group and $match on doc.dups[1] : { $exists : true }.

You do not seem to use the $group field creationTimestamp so you could remove the $push. Unless this is what you want to $sort as the last stage.