Find duplicate records based on a key, insert into new collection and delete duplicates from original collection except 1 record

Hi,

I have duplicate records in collection invoices. I want to perform following steps on the collection:

  1. Identify duplicate records based on a key which I am able to find out:

db.invoices.aggregate([{ $group: { _id: "$request.doc_number", count: { $sum: 1 } } }, { $match: { count: { $gt: 50 } } }] )

  1. Insert duplicate records in new collection:

I’m facing issues with this as they query is inserting only ObjectIds and not full documents.

// Define the name of the new collection
const newCollectionName = "DuplicateInvoices";

// Step 1: Find duplicate doc_number groups
const duplicates = db.Invoices.aggregate([
    {
        $group: {
            _id: "$request.doc_number",
            count: { $sum: 1 },
            docs: { $push: "$_id" } // Collect all document IDs for each doc_number
        }
    },
    {
        $match: {
            count: { $gt: 1 } // Include only groups with more than 1 occurrence
        }
    }
]).toArray();

// Step 2: Insert duplicates into the new collection
duplicates.forEach(group => {
    // Remove one document ID from the list of duplicates to keep in the original collection
    const [idToKeep, ...idsToMove] = group.docs;
    
    // Insert the duplicates (excluding one record) into the new collection
    db[newCollectionName].insertMany(
        idsToMove.map(id => ({ _id: id })) // Insert documents by their IDs
    );
});
  1. I want to delete the duplicate invoices from the original collection except 1 original record.

I am yet to try this.

Appreciate any tips.

Best Regards,

Hi @Mohammad_Yusuf_Ghazi , It looks like you are facing a problem because you are doing a group and after that you need to access your full document.

So, let’s break down your requirements.

  • Identify duplicate records based on a key.
  • Insert duplicate records into a new collection, ensuring full documents are copied.
  • Delete duplicate records from the original collection, keeping one original record.
const duplicates = db.invoices.aggregate([
    {
        $group: {
            _id: "$request.doc_number",
            count: { $sum: 1 },
            docs: { $push: "$$ROOT" } // Collect all documents for each doc_number
        }
    },
    {
        $match: {
            count: { $gt: 1 } // Include only groups with more than 1 occurrence
        }
    }
]).toArray();

At this point we need to push for docs on group stage the full document, you are pushing just the id.

On step 2 we have

// Step 2: Insert duplicates into the new collection
duplicates.forEach(group => {
    // Remove one document from the list of duplicates to keep in the original collection
    const [docToKeep, ...docsToMove] = group.docs;
    
    // Insert the duplicates (excluding one record) into the new collection
    db[newCollectionName].insertMany(docsToMove);
});

At this way, the docsToMove is an array with the full documents, not just the ids.

And finally you can perform the delete query.

I hope that it help you.

Best Regards,

1 Like