Aggregate $group then $limit queries on large datasets

Matthew_Shaylor · July 30, 2021, 12:57pm

I have a dataset of 3 million documents and I want to perform a $group query.

I find that if $sort before I $group my query only touches the indexes it goes pretty fast - but if my query looks at the contents of the documents it takes forever. Worse, if I attempt to $match to reduce the query size it goes slower - even though my $match is on the same index as the $sort.

Are there some way to coax mongo into handling $group gueries better? For example can I hit to the $group that the records are in-order so that it knows what once it finds a document that belongs to a different group that the previous group is complete?

Pavel_Duchovny · August 1, 2021, 5:14am

Hi @Matthew_Shaylor ,

Welcome to MongoDB community.

The official way of optimizing group stages is by sorting and then Grouping based on appropriate index:

Now if you match and group it will work if the order of the fields is Equality Sort Range. Since you group I believe there might be range or other fields not in that order, additionally make sure to add a sort by the index between match and group.

Otherwise I need an explain plan and your indexes to assist further.

Thanks
Pavel

kaveh_zhian · March 31, 2023, 4:14pm

here is an example. you can change it base on object query:

  return $collection->aggregate([
                        ['$match' => ['date'=> [
                            '$gte'=>'2022-01-01',
                            '$lte'=> '2022-12-31'
                        ]]],

                        ['$group' => [
                            '_id' =>  '$importer',
                            'success_num' => ['$sum' => 1],

                        ]],

                        ['$sort'=>[
                            'success_num'=>-1
                        ]],
                        [ '$limit' => 50 ],
                        ]);

shoeb_khan · March 22, 2024, 10:12am

How could I form multiple groups?
Lets say i want to keep groups like with
1:- group 1 >> condition lt
2:- group 2 >> condition gt
3:- group 3 >> condition eq

Pavel_Duchovny · March 22, 2024, 10:38am

Hi @shoeb_khan ,

There are several ways to do it , including doing a $cond with 3 sums (when condition is met you add otherwise you add 0)…

Or another way is to use $facet with 3 pipelines , 1 per each group:

$facet:

sum1 : [{$match : cond1}, {$group ....}],
sum2 :  [{$match : cond2}, {$group ....}],
``

Thanks

shoeb_khan · March 22, 2024, 11:02am

I need help on similar line.

I have a document having a field warrantyEnd (some date)
And I need result of groups with each count. groups needed as below
1:- expired:- 3 (criteria , currenttime > warrantyEnd )
2:- expiringSoon:- 2 (criteria , currenttime < warrantyEnd and (criteria , currenttime+60 > warrantyEnd )
)
3:- underWarranty:- 3 (criteria , currenttime+60 days > warrantyEnd )

John_Sewell · March 22, 2024, 12:32pm

The above options given by Pavel should work fine, you have your criteria from your latest post so try them out either with calculated fields or facets with filters on each facet.

You could use a switch statement:

something like this:

db.collection.

aggregate([
  {
    "$addFields": {
      "CheckField": {
        $switch: {
			branches: [
				{
				  case: {$eq: ["$key",1]},
				  then: "IsOne"
				},
				{
					case: {$eq: ["$key",2]},
					then: "IsTwo"
				},
				{	case: {$eq: ["$key",3]},
					then: "IsThree"
				}
			],
			default: "Other"
		}
      }
    }
  },
  {
    $group: {
      _id: "$CheckField",
      Total: {$sum: 1}
    }
  }
])

Keep in mind performance, with a lot of data you may want to filter before you run the other stages. Given you’re calculating a field an index will not be used.

Same with facets, take note of how an index could be used:

https://www.mongodb.com/docs/manual/reference/operator/aggregation/facet/#:~:text=Index%20Use,first%20stage%20in%20the%20pipeline.

I.e. if you’re not filtering then it’ll trigger a colscan.

If you need it to run and hit an index, it may be an idea to just make sure you have an index on warrentyEnd and make 3 calls so that the index can be fully used.