Aggregation by Date

Peter_Alderson · May 14, 2020, 6:02pm

MongoDB 3.6.18

I have two queries which work…

var q1 = [
    { $unwind: '$resp.'+starCode },
    { $project: { stars: '$resp.'+starCode } },
    { $group: {
        _id: 1,
        total: { $sum: '$stars' },
        count: { $sum: { $cond: [{ $ne: ['$stars',''] },1,0] } },
        average: { $avg: '$stars' }          
        }
    }
]

which returns…

[
 {
  "_id": 1,
  "total": 51,
  "count": 14,
  "average": 3.642857142857143
 }
]

And another…

var q2 = [       
    { $group:  {
        _id: { $dateToString: { format: '%H:%M %Y-%m-%d', date: '$date' } },
        count: { $sum: 1 }
      }
    }        
]

which returns…

[
 {
  "_id": "14:58 2020-05-14",
  "count": 2
 },
 {
  "_id": "14:46 2020-05-14",
  "count": 2
 },
etc,etc,etc.

However when I try an combine these ( count of stars in the period ) using…

var q3 = [
    { $unwind: '$resp.'+starCode },
    { $project: { stars: '$resp.'+starCode } }, 
    { $group:  {
        _id: { $dateToString: { format: '%H:%M %Y-%m-%d', date: '$date' } },
        total: { $sum: '$stars' },
        count: { $sum: 1 }
      }
    }            
]

The the _id value becomes NULL and there’s no grouping - ie…

[
 {
  "_id": null,
  "total": 51,
  "count": 14
 }
]

Any Suggestions welcome.

Doug_Duncan · May 14, 2020, 7:07pm

Hi @Peter_Alderson, your $project stage is saying only pass on the stars field to to the next stage. That means that date if not present in the $group stage. Try adding date: 1 into the $project stage to see if you get the expected results.

Prasad_Saya · May 15, 2020, 3:38am

As @Doug_Duncan mentioned you can change your $project stage, or instead use a $addFields stage:

{ $addFields: { stars: '$resp.'+starCode } }

Both these stages have some common functionality, and different purposes. $project is mostly used to include / exclude fields from the document to be accessed in the following stage. $addFields adds new fields to the already existing ones in a document.

In your case, the $project stage looks appropriate as you are using only two fields in the following $group stage.

Peter_Alderson · May 15, 2020, 8:53pm

Thanks @Doug_Duncan.

Yep, that worked. I was assuming that $project was additive !

Peter

Doug_Duncan · May 15, 2020, 9:18pm

Hi @Peter_Alderson glad that worked out for you.

$project either I want this list of fields, or I don’t want this subset returned.

For adding in new fields, while keeping all the rest that are in the current pipeline, you would use $addFields as @Prasad_Saya mentioned earlier.

Which you use depends on which fields you need from that point on. It’s best to only send the fields on that you need to complete the pipeline to save on the amount of data being passed around.

Stennie_X · May 16, 2020, 4:36am

The aggregation pipeline automatically determines fields that are required, so it is actually best to only add a $project stage if results need to be renamed or reshaped (typically at the end of a pipeline).

Adding an early $project stage can be less efficient because it bypasses the automatic dependency analysis.

If you don’t need to rename or reshape results, there is no need to include a $project stage.

Regards,
Stennie

Prasad_Saya · May 16, 2020, 5:56am

The aggregation can be improved with the pipeline

var q3 = [
    { $unwind: '$resp.'+starCode },
    { $project: { stars: '$resp.'+starCode } }, 
    { $group:  {
        _id: { $dateToString: { format: '%H:%M %Y-%m-%d', date: '$date' } },
        total: { $sum: '$stars' },
        count: { $sum: 1 }
      }
    }            
]

replaced with the following, and eliminating the $project stage:

var q3 = [
    { $unwind: '$resp.'+starCode },
    { $group:  {
        _id: { $dateToString: { format: '%H:%M %Y-%m-%d', date: '$date' } },
        total: { $sum: '$resp.'+starCode },
        count: { $sum: 1 }
      }
    }    
]

Peter_Alderson · May 16, 2020, 12:37pm

Hi @Prasad_Saya,

Can you explain to me the benefit of the improvement you identify ?

Peter

Prasad_Saya · May 16, 2020, 1:11pm

Not having the $project stage. This means your aggregation doesn’t have to scan all the documents after the initial $unwind stage. That is less processing. It matters when there are a large number of documents.

Doug_Duncan · May 16, 2020, 5:26pm

@Stennie_X, you know more about MongoDB and it’s inner workings than I do since you’ve worked at MongoDB for a number of years now, and I am glad that you are here to share your knowledge.

The link you posted is to only the optimization where MongoDB can pull up $match() stages if they are after projections. I think what you really meant was to link the the section above that:

Projection Optimization

The aggregation pipeline can determine if it requires only a subset of the fields in the documents to obtain the results. If so, the pipeline will only use those required fields, reducing the amount of data passing through the pipeline.

Now having said that, I don’t know that I necessarily agree with your statement:

Sure the optimizer can figure out the fields necessary to pass through the pipeline and I see that being a great thing badly written aggregation pipelines. I prefer, however, to be explicit about what I am passing through (it helps me see the data that I’m interested in, especially if there are a large number of fields) and until it’s proven that running a projection, early in the process, can cause performance issues I will continue to $project() only the fields that I need in my pipelines as early as I can. That doesn’t mean that my way is better, but if I’m not having any performance issues, I’m OK with doing it this way. Also isn’t the optimizer just implicitly doing what I’m explicitly doing?

steevej · May 17, 2020, 12:25pm

In addition I will think that if you $project fields of a compound index you avoid a document fetch.

Asya_Kamsky · May 17, 2020, 10:27pm

Stennie is right, $project can make your pipeline less efficient if you happen to specify a field you don’t need or forget to exclude a field you don’t.

For instance in order for a covered index on a and b to be used in an explicit projection you would have to remember to exclude _id if it’s not in the index and not needed.

Unnecessary stages are better left out.

In fact, in this particular aggregation you don’t even need $unwind - it should just be a single $group stage.

Asya

system · May 18, 2020, 10:40pm

This topic was automatically closed 24 hours after the last reply. New replies are no longer allowed.