Finding duplicates in nested sub

Ren_Nat · January 11, 2022, 12:55pm

Hi all,

Here is a sample document

{
	_id : ObjectId("5f4e18f4f90762fe549a1eee")
        name: "Midhuna",
	age: 83,
	places: "New York"
	hobbies: ["Singing", "Reading Books"]
	child: [ 
        {
		_id : ObjectId("5fde18f4f90762fe549a1eee")
                uniqueid : 1234
                name: "Akash",
		age: 25
	},
       {
		_id : ObjectId("5fde18f4f90762fe549a1eee")
                uniqueid : 1234
                name: "Ratan",
		age: 35
	}
    ],
	jobs: [ 
        {
		_id : ObjectId("5fde18f4f90762fe549a554e")
                uniqueid : 3234
                title: "Engineer",
		Salary: 325
	},
       {
		_id : ObjectId("5fde18f4f90762fe549a554e")
                uniqueid : 1234
                title: "Sales Director",
		Salary: 354
	},
           {
		_id : ObjectId("5fde18f4f90762fe549a554e")
                uniqueid : 12345
                title: "Customer Success",
		Salary: 445
	}
    ]

}

I have a collection with several arrays/sub documents. I want to find out if there are duplicates in ‘_id’ field. This has happened in all documents of my database - some nested documents have the same _id despite having different content/values. For example; in the below sample doc I shared the child array and the jobs array both have duplicates in _id field despite all other fields being completely different. I have over 100 different subdocs/arrays and I want to find out how many subdocs have duplicates at _id level and what are those

I used this to find duplicates from a single subdoc:

db.customer.aggregate([
  { $unwind: "$child" },
  {
    $group: {
      _id: {
        _id: "$_id",
        child_uniqueid: "$child._id"
      },
      sum: { $sum: 1 }
    }
  },
  { $match: { sum: { $gt: 1 } } }
])

This works fine. However, I have over 100 subdocs/arrays and I don’t want to manually change the array name here and run it a 100 times. Is there a way, I can loop over the document and run this code for all arrays automatically?

steevej · January 11, 2022, 3:03pm

You must have those 100 subdocs/arrays name somewhere in a list of some sort or in a schema.

It should then be easy to write a simple for-loop that runs your aggregation for each of those names. Something along the lines

field_names = [ /* from your schema or data dictionary */ ] ;
for ( field_name of field_names ) {
  unwind = { "$unwind" : "$" + field_name } ;
  group = { "$group" :
            { "_id" :  { "_id" : "$_id" , [ field_name + "_uniqueid" ] : "$" + field_name + "._id" } ,
            { "sum" : { "$sum" : 1 } } 
          }       
  match = { "$match" : { "sum" : { "$gt" : 1 } } }
  db.customer.aggregate( [ unwind , group , match ] ) ;
}

For one time validation or migration, it is sometimes easier and faster to do it in a little script rather than trying to do everything in a single aggregation. Personally, I would $merge the results into a temporary collection until everything is sanitized.

Asya_Kamsky · January 11, 2022, 4:56pm

You can do this without having to know all the array fields. Here’s how:

[
{
	"$project" : {
		"arrays" : {
			"$filter" : {
				"input" : {
					"$objectToArray" : "$$ROOT"
				},
				"cond" : {
					"$and" : [
						{
							"$eq" : [
								"array",
								{
									"$type" : "$$this.v"
								}
							]
						},
						{
							"$ne" : [
								[ ],
								"$$this.v._id"
							]
						}
					]
				}
			}
		}
	}
}, /* this stage finds all arrays which have `_id` field in its elements
{
	"$project" : {
		"a" : {
			"$map" : {
				"input" : "$arrays",
				"in" : {
					"arrayName" : "$$this.k",
					"size" : {
						"$size" : "$$this.v"
					},
					"uniqueIds" : {
						"$size" : {
							"$setUnion" : "$$this.v._id"
						}
					}
				}
			}
		}
	}
}  /* this stage calculates number of array elements and number of its unique `_id` values 
]

The output this gives on the sample document you provided is this:

{
"_id" : ObjectId("5f4e18f4f90762fe549a1eee"),
"a" : [
	{
		"arrayName" : "child",
		"size" : 2,
		"uniqueIds" : 1
	},
	{
		"arrayName" : "jobs",
		"size" : 3,
		"uniqueIds" : 1
	}
]
}

Now it’s simple to filter out subdocuments where size is equal to uniqueIds and then match only documents where there’s at least one member of the result array…

You can even do that in a single $match:

{$match: {$expr: {$ne: [ 
            [ ], 
            {$filter: {
                  input:"$a", 
                  cond: {$ne: [ "$$this.size", "$$this.uniqueIds"]} 
            }} 
]}}}

Asya
P.S. note that you do NOT need either $unwind or $group since each question can be answered within each individual document. If you wanted to know if there are duplicates across documents, only then would you need to unwind and group.

system · January 16, 2022, 4:56pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.