What's the best way to count documents using $group

satoru_turing · May 22, 2025, 5:59pm

Hello!

I have a collection with about 100k documents (and this number increases every month), and I need to count the documents based on some filter parameters. I’m using aggregation with $group , but in some cases, I need to perform complex checks within the $group stage. I wanted to know if there might be a better way to achieve this:

const result = await collection.aggregate([
  {
    $match: {
      product: params.product, // this is dynamic
      resolved: { $ne: true },
    },
  },
  {
    $group: {
      _id: null,
      countItemsGeneratedWithDelay: {
        $sum: {
          $cond: [
            {
              $and: [
                { $eq: [generated, true] },
                { $eq: [shouldGenerate, true] },
                { $eq: [hasDelayOnGeneration, true] },
              ],
            },
            1,
            0,
          ],
        },
      },
      countNotGeneratedItems: {
        $sum: {
          $cond: [
            {
              $and: [
                { $eq: [generated, false] },
                { $eq: [shouldGenerate, true] },
              ],
            },
            1,
            0,
          ],
        },
      },
      countItemsWithAlertLevelLow: {
        $sum: { $cond: [{ $eq: [alertLevel, 'low'] }, 1, 0] },
      },
      countItemsWithAlertLevelMedium: {
        $sum: { $cond: [{ $eq: [alertLevel, 'medium'] }, 1, 0] },
      },
      countItemsWithAlertLevelHigh: {
        $sum: { $cond: [{ $eq: [alertLevel, 'high'] }, 1, 0] },
      },
    },
  },
  {
    $project: {
      _id: 0,
      alertLevel: {
        low: '$countItemsWithAlertLevelLow',
        medium: '$countItemsWithAlertLevelMedium',
        high: '$countItemsWithAlertLevelHigh',
      },
      countNotGeneratedItems: 1,
      countItemsGeneratedWithDelay: 1,
    },
  },
]).toArray();

I have many more conditions to check and count. Is there a better way to achieve the same result?

daniel_scout · May 23, 2025, 3:13am

not 100% sure what the end goal is,
for example: are you aiming for speed, making it less complex (readability)

i’ll answer both:

speed:

stepping back a bit, something that you can do is ensure you have proper indexes for the fields that you’re filtering on and grouping on. just guessing here though since you didnt provide indexes…ignore this if you’re already doing this

developer experience (i like to call it this…since this can get complex fast)

you can use $addFields. it’ll make this way more readable and maintainable…possibly faster

{
    $addFields: {
      isGeneratedWithDelay: {
        $and: [
          { $eq: ["$generated", true] },
          { $eq: ["$shouldGenerate", true] },
          { $eq: ["$hasDelayOnGeneration", true] },
        ],
      },
      isNotGenerated: {
        $and: [
          { $eq: ["$generated", false] },
          { $eq: ["$shouldGenerate", true] },
        ],
      },
    },
  },

speed/devx

or more advanced, use $facet. this allows you to run multiple aggregation pipelines in parallel…which can be more efficient than multiple conditional sums

steevej · May 23, 2025, 1:02pm

I agree with

But I am not so sure that it could be

I really like

An alternative path to run multiple pipelines is $unionWith. Each pipeline would run on the same coll: and start with a $match and end with a $count. I would keep the $group in the main pipeline but with _id:‘$alertLevel’. I would do the other 2 ‘complicated’ cases with the $unionWith. Yes the $match will repeat the product and resolve criteria, but since you should have an index (possibly 2 partial index one with resolved:true and one with resolve:false) it should be fast. To resume it could look like

Since I really have no clue if the above would be faster than

You should investigate both. The $facet has the potential advantage of matching product and resolve only once. But with proper index, it might not be such a drawback for $unionWith. And please report here the results.

One advantage of $unionWith is that you can develop and test the pipeline independently since they do not reply on the $match of the main pipeline.

I have some reservation about:

Depending of you data and indexes resolved:false might be a better choice. In particular if resolved always exist, you could have a partial index on product with resolved:false and another one with resolved:true.

daniel_scout · May 23, 2025, 1:17pm

oh this is cool i didnt know you could reply to specific portions like quotes!

my main concern with the $unionWith is that even though its the same collection it does separate scans. iow they are not parallelized by default

@steevej is correct on $addFields not being faster.
i ran some tests right now out of curiousity and aside from making the pipeline easier to maintain, it wont make it faster.

an alternative, although it sort of creates more work is to create a separate pre-aggregated collection during write time. i didn’t mention this before as this is another thing to maintain, but could be viable if the aggregation is still a hotspot

steevej · May 23, 2025, 7:19pm

I completely agree.

This is why I mentioned

satoru_turing · May 27, 2025, 1:59pm

Hello @steevej and @daniel_scout, thank you both for your help. After running some tests, I found that the approach using $facet was a little faster than using $unionWith . The execution with $facet took 2.7 seconds, while $unionWith took 3.3 seconds. It’s not a big difference, but I think $facet is easier to maintain, so I’ll go with it. I also reviewed my indexes and made some adjustments to improve them.

steevej · May 27, 2025, 3:44pm

Thanks for the followup.

system · June 1, 2025, 3:44pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.