How do I delete duplicate documents while unique index creation?

I have an mongo instance (about 120GB RAM, 250GB SSD) running 4.0.10. I have a large collection of about 300M documents and I want to add an unique index on a collection. The index creation is quite slow, takes about 2 hours despite using most of the RAM and it is erroring out at 99% saying there is a dupe key.

If I manually try to delete the key, it is taking hours. And seems dropDupes option was removed in recent version of Mongo.

So, what do I do?

Is there any way to make it log the _id of the duplicate key found?

and what would be the fastest way to delete all the dupes?

1 Like

Hi @V_N_A,

Wow honestly there is no easy way to find it in such a huge collection.

I would say run an $group aggregation on that field and locate the value with more than 1 occurance.

Try to run it on a secondary or even spin it on a temp port to isolate the env while locating.

Once you find the value you can locate the _id or even add it in $group stage:

$group : { _id : "$FIELD" , count : { $sum : 1 } , id : { $push : "$$ROOT._id" }}

Thanks
Pavel

1 Like

Hey @Pavel_Duchovny, thank you for responding. I will give you some more context. A document in my collection looks like this:

{
        "_id" : ObjectId("6055ec15bfcb7198dffa042c"),
        "message_id" : "TeiaBrCXgGyaf6aTk",
        "client_id" : "april",
        // bunch of other fields
} 

_id is the primary key. And I want to create the unique index on (client_id, message_id) fields. I tried to find the dupes, using aggregate framework, it was very slow. The console wouldn’t respond for hours.

So, I created two indexes:

{
	"v" : 2,
	"key" : {
		"client_id" : 1,
		"message_id" : 1
	},
	"name" : "client_id_1_message_id_1",
	"ns" : "my_db.msgs"
},
{
	"v" : 2,
	"key" : {
		"_id" : 1,
		"client_id" : 1,
		"message_id" : 1
	},
	"name" : "_id_1_client_id_1_message_id_1",
	"ns" : "my_db.msgs"
},

this is my aggregate query, do you find anything odd in it?

db.msgs.aggregate([
  { $group: {
    _id: { "message_id": "$message_id", "client_id": "$client_id" },
    count: { $sum: 1 },
    id : { $push : "$$ROOT._id" } 
  } },
  { $sort : { count : -1} }, 
  { $match: { 
    count: { $gte: 2 } 
  } },
], {
  allowDiskUse: true
});

Why this is not using any of the indexes? Here is what it says when I do explain on it:

{
	"stages" : [
		{
			"$cursor" : {
				"query" : {

				},
				"fields" : {
					"client_id" : 1,
					"message_id" : 1,
					"_id" : 1
				},
				"queryPlanner" : {
					"plannerVersion" : 1,
					"namespace" : "my_db.msgs",
					"indexFilterSet" : false,
					"parsedQuery" : {

					},
					"winningPlan" : {
						"stage" : "COLLSCAN",
						"direction" : "forward"
					},
					"rejectedPlans" : [ ]
				}
			}
		},
		{
			"$group" : {
				"_id" : {
					"message_id" : "$message_id",
					"client_id" : "$client_id"
				},
				"count" : {
					"$sum" : {
						"$const" : 1
					}
				},
				"id" : {
					"$push" : "$$ROOT._id"
				}
			}
		},
		{
			"$match" : {
				"count" : {
					"$gte" : 2
				}
			}
		},
		{
			"$sort" : {
				"sortKey" : {
					"count" : -1
				}
			}
		}
	],
	"serverInfo" : {
		"host" : "my_db",
		"port" : 27017,
		"version" : "4.0.23",
		"gitVersion" : "07c6611b38d2aacbdb1846b688db70b3273170fb"
	},
	"ok" : 1
}

Hi @V_N_A,

It needs to do a full scan of 300M documents with this grouping, this is heavy and may take many hours…

As suggested consider isolating it on a secondary stared on a temp port or take one of your backups and spin it on a dedicated host.

There is no way to avoid the scanning here…

As suggested consider isolating it on a secondary stared on a temp port or take one of your backups and spin it on a dedicated host.

yup, doing this on a machine, on which I had inserted data with a dump. It’s not part of the replica set.

Thanks for confirming @Pavel_Duchovny I am running the aggregate query and I will wait till this is finished.

I would really appreciate if you can have a look at this - What happens in case of a conflict during replication?

Is there any way to speed up the background index creation? I am on Mongo 4.0.10 and the option of maxIndexBuildMemoryUsageMegabytes applies? cos in that page it says it applies to foreground builds, but on the Index Creation page - Index Build Operations on a Populated Collection — MongoDB Manual it is under background index creation section and does seem imply that it applies to background index creation as well

@V_N_A,

Not really. Another option is to upgrade to 4.2+ with the improve index builds…

Thanks.
Pavel

use following code
async () => {
let docs = db.collection(col_name).aggregate([
{
“$group”: {
_id: {docId: “$docId”},
dups: { $addToSet: “$_id” } ,
count: { $sum : 1 }
}
},
{
“$match”: {
count: { “$gt”: 1 }
}
}
]).forEach(function(doc) {
doc.dups.shift();
db.collection(col_name).remove({
_id: {$in: doc.dups}
});
// return doc
});

// return docs;

}

replace docId with the key which u want as unique