Performance Issue when using $group Pipeline

Hey there, we’re using M60 cluster ( 8.0.11 ), and we have one query that needs to run on all orders from a specific “tenant” ( you basically have multiple orders per tenant ).

Now the collection size is 50M documents, and the aggregation that I’m performing is:

  1. $match:
{
  tenantId: "MY_TENANT_ID",
  status: {$exists: 1}
}
  1. $group
{
  _id: "$uuid",
  condition_count_complated: {
    $sum: {
      $cond: [
        {
          $in: [
            "$status",
            [
              "WAITING_FOR_APPROVAL",
              "APPROVED",
              "SENT"
            ]
          ]
        },
        1,
        0
      ]
    }
  }
}
  1. $match
{
  condition_count_complated: {
    $eq: 0
  }
}

now most of the query execution time is the second pipeline stage, the $group.

I know that $group doesn’t use indexes ( even though uuid: 1, status: 1 index exists ).
Adding $sort before that will not help as I don’t use any accumulators (like $first / $last ) inside the $group.

The avg execution time for this query is about 10-20 min ( depending on which tenant id we matching at first, if one tenant has alot of orders that it will take more time ).

I would love to get some opinions here of how I can improve this result?
Does MongoDB is even being used for such operations?

Hi @yoni_egorov, so I took some time to understand what’s going on here, I did some research and yes you are correct, the $group operator does not use indexes. However, I think there’s a better approach to this.

In the $group stage, I can see that you are still filtering, which is slowing things down, I think you should move this extra condition to the first $match stage so this makes the query faster.

Plus, because the $match stage is basically a .find(), your indexes will work in the $match stage.

So the aggregation pipeline should look like this -

  1. $match
{
    tenantId: "MY_TENANT_ID",
    status: {
      $in: ["WAITING_FOR_APPROVAL", "APPROVED", "SENT"]
}
  1. $group
{
    _id: "$uuid"
}

This gets you the uuid of the documents you are really trying to find, then you can get all uuids by the tenant like this -

db.collection.distinct("uuid", { tenantId: "MY_TENANT_ID" })

Lastly, you subtract the first aggregation results from the collection.distinct() result to give you that difference you are looking for, I hope this helps.

1 Like

Additional to Michael’s answer above, since you have your last step has:

{
  condition_count_complated: {
    $eq: 0
  }
}

The 0 is for status NOT IN ["WAITING_FOR_APPROVAL", "APPROVED", "SENT"]. You can use $nin operator for that.

$nin selects the documents where:

  • the specified field value is not in the specified array or
  • the specified field does not exist.

So the first $match stage could directly be:

{
    tenantId: "MY_TENANT_ID",
    status: {
      $nin: ["WAITING_FOR_APPROVAL", "APPROVED", "SENT"]
}

Mongo Playground