Using setDifference with mutiple collections

I am working with backups. At the moment I have two collections: A and A2.
documents in both locations look like this
{_id:ObjectId(“6456456456”), “name”:“a_unique_name”}

The documents in A don’t share the same _id, they only share the “name” attribute.

So my question is how can I use an aggregation pipeline to compare the two collections and find the difference between the sets of “name” attributes.

Someone gave me a solution that required the creation of an additional attribute. I don’t think that sounds like an efficient idea. I image MongoDB is very powerful and should be able to not resort to setting and unsetting attributes per document when I have almost a billion of them.

My approach would use $lookup.

I would aggregate first on A to $lookup in A2, then a second aggregation on A2 to $lookup in A.

pipeline_A = [ ] ;

lookup = { "$lookup" : {
    "from" : "A2" ,
    "localField" : "name" ,
    "foreignField" : "name" ,
    "as" : "A2" ,
    "pipeline" : [
        { "$project" : { "_id" : 1 } ,
        { "$limit" : 1 }
    ]
} }

pipeline_A.push( lookup ) ;

match = { "$match" : {
    "A2.0" : { "$exists" : false }
} } ;

pipeline_A.push( match ) ;

project = { "$project" : {
    "name" : 1
} }

/* The following should produce names in A that are not in A2 */

db.A.aggregate( pipeline_A ) ;

You then do the same with pipeline_A2.

Yes, there are 2 database accesses but with $unionWith you can do it in 1.

Yes, matching names are processed twice.

Yes, you need an index with name:1 prefix to have some kind of performance.

When doing that sort of things, I like to $out the result in a temporary collection to explore and process the results later.

2 Likes

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.