Best way to remove duplicates for millions of records

Hi Team,

We’re trying to find duplicates methods in our collection with about 2 or 3 millions of documents, we have tried this way but this seems slow and cannot be ideal when deploying to production, do you know what’s the best and fastest way to delete duplicate records with million of data.

Sample code we have:

db.mycollection.aggregate([
{"$match":{
  "create_date_audit":{
      $gte: ISODate('2022-07-25T18:27:56.084+00:00'),
      $lte: ISODate('2022-07-26T20:15:50.561+00:00')
  }
}},
{"$sort":{
  _id: -1
}},
{"$group":{
  _id: {
    notification_id: '$notifId',
    empId: '$empId',
    date: '$date'
  },
  dups: {
    $push: '$_id'
  },
  creationTimestamp: {
    $push: '$create_date'
  },
  count: {
    $sum: 1
  }
}},
{"$match":{
  _id: {
    $ne: null
  },
  count: {
    $gt: 1
  }
}},
{"$sort":{
  create_date: -1
}},
], { allowDiskUse: true }).forEach(function(doc) {    
    db.mycollection.deleteMany({_id : {doc.dups[0]});  
})```

Dealing with duplicates is always a problem. So please do not post duplicate post about the same issue.

Hi, I am facing the same issue. how to manage it?

Removing duplicates from millions of records can be a challenging task. Here are some best ways to remove duplicates for millions of records:

  1. Sorting and comparing: Sort the data based on one or more columns, and compare adjacent rows to identify duplicates. This method can work well if the data set is not too large and the fields to be compared are not too complex.
  2. Hashing: Create a hash value for each record, based on the fields that you want to use to identify duplicates. Then compare the hash values to identify duplicates. This method is efficient for large data sets and can be used to identify duplicates across multiple fields.
  3. Database Management System: Most database management systems offer a built-in mechanism for identifying and removing duplicates. For example, SQL provides the DISTINCT keyword to remove duplicates from a SELECT statement.
  4. Third-party software: Several third-party software solutions are available to identify and remove duplicates from large data sets. These solutions use algorithms and machine learning
1 Like