Improve $GROUP performance

We are looking for ways to improve the performance of the $GROUP pipeline in aggregation queries. Looking for ways to configure and to improve performance of the $GROUP pipeline to handle high volume of documents without filtering the number of documents? I might be wrong but looks like $GROUP don’t use index. I added $SORT before the $GROUP but it just takes longer to finish the execution. Will increasing I/O speed, or memory or CPU help improve the performance? We have queries that take more than 5 minutes to runvwith the bottleneck on the $GROUP pipeline and hardware analysis shows no problem on disk, memory and CPU. Any technique or configuration to make $GROUP run faster?

2 Likes

It is the expected behaviour if you $sort on fields that are not indexed.

If you do not need all fields from your documents, you can $project using an index that covers the grouped field so that documents are not fetched. For example, if you want the min/max/avg of the field amount to be grouped by an account_number. Then an index {account_number:1, amount:1} and a $project stage that only keep account_number and amount should be faster.

If you need all the fields of all documents, then enough RAM to hold all documents in memory will help. Fetching documents on disk is very slow.

1 Like

Hi @Eric_Pedregosa , and welcome to the forums!

Any technique or configuration to make $GROUP run faster?

Additionally to what @steevej has mentioned, if there are no indexes please consider adding index on keys where appropriate. See also db.collection.createIndex() and Indexes for more information.

I might be wrong but looks like $GROUP don’t use index

Please see Pipeline Operators and Indexes for more information on $group stage and indexes.

You can also see more detailed information of the aggregation pipeline execution plan by performing db.collection.explain().aggregate([ … ])

There is already an Aggregation Pipeline Optimisation for this in place. The optimiser can determine if the pipeline requires only a subset of the fields in the documents to obtain results, and only use those required fields. The general recommendation in this case is to avoid adding $project before $group as it may prevent the optimiser to perform efficiently.

Regards,
Wan.

3 Likes

Hi,

Thanks for the reply.

This is the structure of the aggregation. The $match pipeline used covered index

$match(a,b)
$group(c,d,e,f,sum(g))
$project(c,d,e,f,g)
option: allowDiskUse:true

index used - a,b,c,d,e,f,g (covered)

added $project(c,d,e,f,g) before $group - “millis” : 728257
added $sort(c,d,e,f,g) before $group - “millis” : 850411

-sh-4.2$ free -m
total used free shared buff/cache available
Mem: 257788 23752 229323 1 4712 232751
Swap: 16383 0 16383

$match is using covered query
“keysExamined” : 84814763,
“docsExamined” : 0,

How do we force the $group to use index? How do you know if the $sort and $group is using index, or what index it is using? Explain and Profiler only shows index used in the $match pipeline. I requested additional memory and better IOps and run some slow queries with our DevOps team and they observed here was no limitation on the memory or I/O that will cause slowness of the query.

Thanks.
Eric

Hi,

Basically, index used is getting define in the $match pipeline. How do we force the $group to use index if the $match is a,b and $group is c,d,e,f?

Also, $group stage has a limit of 100 megabytes of RAM and in excess to use the allowDiskUse option. Does it mean we need faster I/O for large number of records? In our case, the $group execution is slow but it didn’t show when we monitor the I/O.

Thanks.
Eric

Hi @Eric_Pedregosa ,

In a development/testing environment create a compound index for those fields. See also covered query.

Use the output of explain command to diagnose whether the query is just reading from index or fetching from disk i.e. FETCH.

Regards,
Wan.