I have a collection with several arrays/sub documents. I want to find out if there are duplicates in ‘_id’ field. This has happened in all documents of my database - some nested documents have the same _id despite having different content/values. For example; in the below sample doc I shared the child array and the jobs array both have duplicates in _id field despite all other fields being completely different. I have over 100 different subdocs/arrays and I want to find out how many subdocs have duplicates at _id level and what are those
I used this to find duplicates from a single subdoc:
This works fine. However, I have over 100 subdocs/arrays and I don’t want to manually change the array name here and run it a 100 times. Is there a way, I can loop over the document and run this code for all arrays automatically?
You must have those 100 subdocs/arrays name somewhere in a list of some sort or in a schema.
It should then be easy to write a simple for-loop that runs your aggregation for each of those names. Something along the lines
field_names = [ /* from your schema or data dictionary */ ] ;
for ( field_name of field_names ) {
unwind = { "$unwind" : "$" + field_name } ;
group = { "$group" :
{ "_id" : { "_id" : "$_id" , [ field_name + "_uniqueid" ] : "$" + field_name + "._id" } ,
{ "sum" : { "$sum" : 1 } }
}
match = { "$match" : { "sum" : { "$gt" : 1 } } }
db.customer.aggregate( [ unwind , group , match ] ) ;
}
For one time validation or migration, it is sometimes easier and faster to do it in a little script rather than trying to do everything in a single aggregation. Personally, I would $merge the results into a temporary collection until everything is sanitized.
Now it’s simple to filter out subdocuments where size is equal to uniqueIds and then match only documents where there’s at least one member of the result array…
Asya
P.S. note that you do NOT need either $unwind or $group since each question can be answered within each individual document. If you wanted to know if there are duplicates across documents, only then would you need to unwind and group.