$group overriding $project MOngoDB

I m working with MongoDB but i m facing problem with $project not displaying the data after $group is executed .

My database look like this

# In DATABASE
{
    "_id": 8144,
    "merchant_id": 8144,
    "name": "Google",
    "email": "google-world@gmail.com",
}
{
    "_id": 2,
    "merchant_id": 2,
    "name": "LENOVO",
    "email": "lenovo-world@gmail.com",
    "submerchant_id": 8144,
}
{
    "_id": 3,
    "merchant_id": 3,
    "name": "HP",
    "email": "hp-world@gmail.com",
    "submerchant_id": 8144,
}
{
    "_id": 4,
    "merchant_id": 4,
    "name": "DELL",
    "email": "dell-world@gmail.com",
    "submerchant_id": 8144,
}

The output I want is to display the details of the merchant along with count of submerchants

#Output
{
    "_id": 8144,
    "merchant_id": 8144,
    "name": "Google",
    "email": "google-world@gmail.com",
    "count": 3,  
    # merchant_id : 8144 has 3 submerchants (Lenovo,hp,dell)
}

MY QUERY is below

db.merchant_info.aggregate([
 {  '$group': {'_id': '$submerchant_id', 'count': {'$sum': 1}}},

{ '$project': {'merchant_id': 1, 'name': 1, 'email': 1,'count' :1} }])
           

BUT I m getting the output has this

{ _id: 8144, count: 3 }

Why is it not showing the merchant_id,name,email data ? Where am i going wrong please let me know ?

output of $group stage has only two fields: _id and count, so regardless of $projection, you won’t get them. You need to somehow fetch merchant_id/name/email… based on the _id field of $group output.

I’m not familiar with aggregation so not sure how to fix it.

Ignore what I wrote in my first edit. Which I keep here for prosperity.

I think that the _id of your $group is wrong. From the output you shared, it should be $merchant_id.

To get merchant_id, name and email in the output, use $first accumulator.

It looks like the monkey felt from the tree. I did not analyzed the sample data enough. Asya’s following post made me realized that.

This is a cute problem. $project operates on the documents that come out of the stage before it, and your $group stage only outputs two fields, so of course there is no way to manufacture the needed details. Here is how I would recommend doing the aggregation you need:

db.merchant_info.aggregate([
    {$set:{
        submerchant_id:{$ifNull:["$submerchant_id","$merchant_id"]},
        submerchant:{$cond:{if:{$eq:["missing",{$type:"$submerchant_id"}]},then:false, else:true}}}
    }, 
    {$sort:{submerchant_id:1, submerchant:1}},
    {$group:{
        _id:"$submerchant_id", 
        name:{$first:"$name"}, 
        email:{$first:"$email"}, 
        count:{$sum:1}, 
        submerchants:{$push:{_id:"$_id",name:"$name",email:"$email"}}
    }},
    {$project:{
        _id:0, 
        merchant_id:"$_id", 
        name:1, 
        email:1, 
        count:{$subtract:["$count",1]}, 
        submerchants:{$slice:["$submerchants",1,"$count"]}}
    }
] )

Output would be:

{
"name" : "Google",
"email" : "google-world@gmail.com",
"merchant_id" : 8144,
"count" : 3,
"submerchants" : [
	{
		"_id" : 2,
		"name" : "LENOVO",
		"email" : "lenovo-world@gmail.com"
	},
	{
		"_id" : 3,
		"name" : "HP",
		"email" : "hp-world@gmail.com"
	},
	{
		"_id" : 4,
		"name" : "DELL",
		"email" : "dell-world@gmail.com"
	}
]
}

The trick I used is taking advantage of the fact that the “parent” merchant does not have the “submerchant_id” field. I added it so that it would be there to group by, I added a field to preserve which document was the parent merchant and I sorted it by that field (taking advantage of false sorting before true) so that parent would be first and I could use $first to preserve its name, etc. Then in the last $project stage I reshape the document and correct count (to remove the parent from the count of its submerchants) and remove it from the array of submerchants as well.

There are probably many different ways to achieve this, hope this example helps you understand and decide whether this is the right approach for you or not. The way you did it can actually work as well, but after the $group stage you would need to perform $lookup on the same collection to get the fields you are trying to add in $project. That would also work, though you’d need to filter out the parent documents before (or after) the $group otherwise you’ll have an extra document with _id: null and you’d still need to do some reshaping of documents in the last stage.

Asya

5 Likes

Hi Asya ,
Your Solution is Great !
Just a small difference if the database as
merchant 1 → submerchant 1 of merchant 1 (merchant 2 ) → submerchant 1 of merchant 2

{
    "_id": 70,
    "merchant_id": 70,
    "name": "Apple",
    "email": "apple-world@gmail.com",
    "submerchant_id": 10,
}
{
    "_id": 8144,
    "merchant_id": 8144,
    "name": "Google",
    "email": "google-world@gmail.com",
    "submerchant_id": 70,
}
{
    "_id": 2,
    "merchant_id": 2,
    "name": "LENOVO",
    "email": "lenovo-world@gmail.com",
    "submerchant_id": 8144,
}
{
    "_id": 3,
    "merchant_id": 3,
    "name": "HP",
    "email": "hp-world@gmail.com",
    "submerchant_id": 8144,
}
{
    "_id": 4,
    "merchant_id": 4,
    "name": "DELL",
    "email": "dell-world@gmail.com",
    "submerchant_id": 8144,
}

How does the query change in this case?

Now that I understand the problem better, thanks to Asya, I would like to propose an alternative solution which uses $lookup rather than $group/$push to get the submerchants.

The first pipeline stage matches the parent merchant using:

match = { "$match" : {
    "submerchant_id" : { "$exists" : false }
} }

Documents out from this stage will documents with _id:8144, the _id of the $group.

Then the $lookup stage (corrected thanks to @Joseph_Anjilimoottil) is using the simplest form:

lookup = { "$lookup" : {
    "from" : "merchant_info" ,
    "localField" : "merchant_id" ,  
    "foreignField" : "submerchant_id" ,
    "as" : "submerchants"
    /* An optional pipeline could be added here in order to $project only the fields 
       that we want to keep (_id, name, email).   Sometimes it is better to do more work 
       on the server to download less data and sometimes it is better to do less work 
       to download more data.  It depends on the use-case and the data.  If you $project
       to get rid of a few simple fields, useless, if you $project to remove sensitive information
       or to reduce the size of bloated documents please do it.
    */
} }

After, this stage the document _id:8144 will be:

{ _id: 8144,
  merchant_id: 8144,
  name: 'Google',
  email: 'google-world@gmail.com',
  submerchants: 
   [ { _id: 2,
       merchant_id: 2,
       name: 'LENOVO',
       email: 'lenovo-world@gmail.com',
       submerchant_id: 8144 },
     { _id: 3,
       merchant_id: 3,
       name: 'HP',
       email: 'hp-world@gmail.com',
       submerchant_id: 8144 },
     { _id: 4,
       merchant_id: 4,
       name: 'DELL',
       email: 'dell-world@gmail.com',
       submerchant_id: 8144 } ] }

You still do not have the count but you could easily $set/$addFields or $project to get it. But since you are getting the array, there is no point to add load on the server to get something that your programming language gives you for free.

Side discussion:

One facet of using $lookup rather that $group that I do not know is performance. As far as I know, $group blocks until all incoming documents are processed. It makes sense since you cannot know if a document is part of any group or not. But it is possible that a $match’ed document with its $lookup could be produced faster if both are served by appropriate indexes and because $match and $lookup are not blocking.

May be, $graphLookup if you are using the $lookup alternative.

@Asya_Kamsky How does the query change in this case?

@steevej your lookup seem to be jumbled , shouldn’t it be like this

lookup = { "$lookup" : {
    "from" : "merchant_info" ,
    "localField" : "merchant_id" ,
    "foreignField" : "submerchant_id" ,
    "as" : "submerchants"
   
} }

You are right. I must have cut-n-paste the first version which was wrong. I will update my post and indicate that you pointed the error.

Looks like y’all figured it out - using $lookup will let you get the details of each merchant after grouping the submerchants by their “parent”. $graphLookup could be useful if you want to know all the “trees”. Only thing is I don’t think the $match works given that every sample document has a submerchant_id field present…

Asya

Yea $lookup worked :partying_face: :partying_face: !!

db.merchant_info.aggregate([
{'$lookup':  {
                    'from': "merchant_info",
                    'localField': "merchant_id",
                    'foreignField': "submerchant_id",
                    'as': "submerchant_merchants"
                }
  },
{ '$addFields': {'Count_merchant': {'$size': "$submerchant_merchants"}} },
{ '$project': {'merchant_id': 1, 'name': 1, 'email': 1,'Count_merchant' :1} }])

The thing that still doubts me is if i have over a million or 10 million documents , then loading the submerchants list and then counting the size , wouldn’t it take a lot of processing time ?

Why don’t you try it? In general in this pipeline the $group will be the most intensive stage, not $addFields or $project. $lookup may be depending on how many documents are going through that stage (it happens after $group so there are fewer than what you started with).

Asya

Hi All, I’ve been tinkering with this pipeline since I’ve got a very similar use case and I do have millions of documents to process, I’m noticing some interesting things related to performance.

In my use case, I need to guarantee the submerchants are unique, so I’m using $addToSet instead of $push.

If I don’t index any of the fields involved, performance begins to suffer noticeably on my particular cluster somewhere around 1M records. I added an index on merchant_id and didn’t really notice much improvement, but did notice that the winning query plan (from explain()) elected to use the index.
After that, I decided to add a few more indexes with the other fields involved in the $group operation, like like $name and $email, and also compound indexes with all three fields. The winning query plan was using the compound index that included the most fields. Performance was noticeably way better.

I’m confused by this - if the documents are already sorted before grouping, and then the group operation just needs to assemble the submerchants and they’re all “right there” next to each other, why does the $group operation perform so much worse without the compound index? Is mongo doing some sort of subquery on each call to $push/$addToSet as it’s assembling the submerchant array? If so, I’m thinking a non-group based strategy of streaming the documents through the pipeline would be more performant. Anyone know what’s going on during that $push in the $group call?

I’m confused by this - if the documents are already sorted before grouping, and then the group operation just needs to assemble the submerchants and they’re all “right there” next to each other, why does the $group operation perform so much worse without the compound index?

Probably because compound index includes all the fields it needs creating a covered index query - meaning everything is available in the index and there is no need to fetch the actual documents themselves.

Asya

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.