Aggregation slow without hint, but winning plan same as with hint

Waldemar_Dick · January 25, 2023, 7:47am

Hi,
I have a very slow aggregation (240 sec), but an index exists, which should match the query perfectly.
If I add a hint to the aggregation with the index name it is blazing fast (0.2 sec) on 120M records.

Now, if I compare the explain output between the aggregation with and without hint, they look the same.
Both use a IXSCAN on the same index.

What am I missing?

MongoDB 5.0.9

Query

db.getCollection('events').aggregate([
{ $match: { 'action':'signed',
            'quality':{ $in: [ 'QES', 'AES', 'SES' ] },
            'timestamp':{'$gte': ISODate("2023-01-01T00:00:00Z"), '$lt': ISODate("2023-02-01T00:00:00Z")},
            'paid_by':"2fa321ccaeab411b95066bd7b6f4588d"}},
{ $project: { _id: 0, 'quality':1}},
{ $group : { _id: '$quality', "count":{$sum:1}}}
])

Index:

db.getCollection('events').createIndex(
{ 
    "action" : 1,
    "quality" : 1,
    "timestamp" : 1,
    "paid_by" : 1,
    "user.business_id" : 1,
    "user.username" : 1,
    "legislation" : 1,
    "tsp_type" : 1
},
   { 
       name:"signed_quality_timestamp_paidby_legislation_tsptype",
       partialFilterExpression: { 
           'action': 'signed'
       }
   }
)

steevej · January 26, 2023, 1:10pm

If your usual use-case is the query shared where you have equality match on paid_by and range match on timestamp, your index would be better with paid_by before timestamp.

I don’t think your $project helps but it may negatively impact.

Gilcemir_Filho · December 26, 2023, 1:55pm

Hi!
I can help you with some insights that I took from your query and your index.
The first think I noticed is that your index uses a lot of fields. This can be a problem because the indexes are documents just like the documents from your collections. So if your index is huge, this will increase the overall cost of the operations in your collections.
I would recommend you to create an index with less fields.

The second thing that I noticed is that you are not using the ESR rule for your index, which says that you should place first the equality, then sort and lastly the range field.
For your use case, action and paid_by are equality operators, timestamp and quality are range operators and you don’t have any sort operation, so for this case it would not be necessary.

Another topic that I would like to mention is that you should try to always use the more granular fields first, for the equality operators. It means that, in your cause, you should place paid_by after action, because I imagine that paid_by can have much more different values than action.

And since you only need the _id field and the quality field, I would recommend you inserting this field in the last position of your index, so your query will not need to fetch the document from disk, because the index will have everything that is needed.

I would recommend this index for this specific query:

db.getCollection('events').createIndex({
    'paid_by': 1,
    'action': 1,
    'timestamp': 1,
    'quality': 1,
    '_id': 1
})

And if you really thinks that the partialfilterexpression is necessary, you can add it, but I dont think it is the case.