What's the best way to count documents using $group

Hello!

I have a collection with about 100k documents (and this number increases every month), and I need to count the documents based on some filter parameters. I’m using aggregation with $group , but in some cases, I need to perform complex checks within the $group stage. I wanted to know if there might be a better way to achieve this:

const result = await collection.aggregate([
  {
    $match: {
      product: params.product, // this is dynamic
      resolved: { $ne: true },
    },
  },
  {
    $group: {
      _id: null,
      countItemsGeneratedWithDelay: {
        $sum: {
          $cond: [
            {
              $and: [
                { $eq: [generated, true] },
                { $eq: [shouldGenerate, true] },
                { $eq: [hasDelayOnGeneration, true] },
              ],
            },
            1,
            0,
          ],
        },
      },
      countNotGeneratedItems: {
        $sum: {
          $cond: [
            {
              $and: [
                { $eq: [generated, false] },
                { $eq: [shouldGenerate, true] },
              ],
            },
            1,
            0,
          ],
        },
      },
      countItemsWithAlertLevelLow: {
        $sum: { $cond: [{ $eq: [alertLevel, 'low'] }, 1, 0] },
      },
      countItemsWithAlertLevelMedium: {
        $sum: { $cond: [{ $eq: [alertLevel, 'medium'] }, 1, 0] },
      },
      countItemsWithAlertLevelHigh: {
        $sum: { $cond: [{ $eq: [alertLevel, 'high'] }, 1, 0] },
      },
    },
  },
  {
    $project: {
      _id: 0,
      alertLevel: {
        low: '$countItemsWithAlertLevelLow',
        medium: '$countItemsWithAlertLevelMedium',
        high: '$countItemsWithAlertLevelHigh',
      },
      countNotGeneratedItems: 1,
      countItemsGeneratedWithDelay: 1,
    },
  },
]).toArray();

I have many more conditions to check and count. Is there a better way to achieve the same result?

not 100% sure what the end goal is,
for example: are you aiming for speed, making it less complex (readability)

i’ll answer both:

speed:

  • stepping back a bit, something that you can do is ensure you have proper indexes for the fields that you’re filtering on and grouping on. just guessing here though since you didnt provide indexes…ignore this if you’re already doing this

developer experience (i like to call it this…since this can get complex fast)

  • you can use $addFields. it’ll make this way more readable and maintainable…possibly faster
{
    $addFields: {
      isGeneratedWithDelay: {
        $and: [
          { $eq: ["$generated", true] },
          { $eq: ["$shouldGenerate", true] },
          { $eq: ["$hasDelayOnGeneration", true] },
        ],
      },
      isNotGenerated: {
        $and: [
          { $eq: ["$generated", false] },
          { $eq: ["$shouldGenerate", true] },
        ],
      },
    },
  },

speed/devx

  • or more advanced, use $facet. this allows you to run multiple aggregation pipelines in parallel…which can be more efficient than multiple conditional sums
2 Likes

I agree with

But I am not so sure that it could be

I really like

An alternative path to run multiple pipelines is $unionWith. Each pipeline would run on the same coll: and start with a $match and end with a $count. I would keep the $group in the main pipeline but with _id:‘$alertLevel’. I would do the other 2 ‘complicated’ cases with the $unionWith. Yes the $match will repeat the product and resolve criteria, but since you should have an index (possibly 2 partial index one with resolved:true and one with resolve:false) it should be fast. To resume it could look like

Since I really have no clue if the above would be faster than

You should investigate both. The $facet has the potential advantage of matching product and resolve only once. But with proper index, it might not be such a drawback for $unionWith. And please report here the results.

One advantage of $unionWith is that you can develop and test the pipeline independently since they do not reply on the $match of the main pipeline.

I have some reservation about:

Depending of you data and indexes resolved:false might be a better choice. In particular if resolved always exist, you could have a partial index on product with resolved:false and another one with resolved:true.

2 Likes

oh this is cool i didnt know you could reply to specific portions like quotes!

my main concern with the $unionWith is that even though its the same collection it does separate scans. iow they are not parallelized by default

@steevej is correct on $addFields not being faster.
i ran some tests right now out of curiousity and aside from making the pipeline easier to maintain, it wont make it faster.

an alternative, although it sort of creates more work is to create a separate pre-aggregated collection during write time. i didn’t mention this before as this is another thing to maintain, but could be viable if the aggregation is still a hotspot

2 Likes

I completely agree.

This is why I mentioned

Hello @steevej and @daniel_scout, thank you both for your help. After running some tests, I found that the approach using $facet was a little faster than using $unionWith . The execution with $facet took 2.7 seconds, while $unionWith took 3.3 seconds. It’s not a big difference, but I think $facet is easier to maintain, so I’ll go with it. I also reviewed my indexes and made some adjustments to improve them.

2 Likes

Thanks for the followup.

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.