Best index on aggregation with multiple match conditions

I’m trying to understand how the aggregation pipeline optimization works and what is the best index (or maybe more than one) to satisfy the following aggregation that have some optional stage with $match, based on user filter from UI, so it’s like to have 4 four different queries to satisfy.

[{
    "$match": {
      "__STATE__": { "$eq": "PUBLIC" },
      "countryCode": {
        "$exists": true,
        "$ne": null
      }
    }
  },
{
    "$project": {
      "countryCode": 1,
      "createdAt": 1,
      "success": 1
    }
  },
//optional
{
    "$match": {
      "success" : true
      }
 },
//optional
{
    "$match": {
     {
          "$expr": {
            "$and": [{
              "$lte": [
                "$createdAt",
                {
                  "$dateFromString": {
                    "dateString": "#createdAtTo#"
                  }
                }
              ]
            }, {
              "$gte": [
                "$createdAt",
                {
                  "$dateFromString": {
                    "dateString": "#createdAtFrom#"
                  }
                }
              ]
            }]
      }
 },
]

I’ve tried with some compound indexes and testing them with explain but I’m not getting good results also on basic case (only first $match), maybe I have to rewrite some lines?

Hi @Francesco_Fiorentino,

The way to optimize aggregation pipelines is to try and minimize the amount of stags while pushing as much filtering as possible to the first stage. So if your optional stages can be added to the first stage this will be really helpful.

In general indexing should have the order of Equality Range Sort when it comes to compound index placements.

In your case I suggest to do an index on:

{ "__STATE__" : 1 ,  "countryCode": 1 , "success": 1 , "createdAt": 1}

And changed the projectiion to:

{
    "$project": {
    "_id" : 0,
      "countryCode": 1,
      "createdAt": 1,
      "success": 1
    }
  }

Read this: Performance Best Practices: Indexing | MongoDB Blog

Thanks
Pavel

1 Like

The courses M121 and M201 from https://university.mongodb.com are also very good resources in this regard.

You can run explain and see how the pipeline gets transformed. Two things you can see from that:

  1. all the $match stages get coalesced into one
  2. project gets factored out and set to the query subsystem (so it doesn’t matter where it is in this pipeline though I recommend putting it last always, regardless).

What we see is that you are using $expr for some reason and it’s NOT as efficient as regular match expressions. If the format of createdAtFrom date string is “normal” then it can just be passed to ISODate() constructor as right hand side of the comparison.

Now, best index will always start with {__STATE__:1, countryCode:1, ... } since it seems like those are always filtered on, but the order of the other three fields depend on which filtering is more likely (and equality being ahead of range comparisons).

Asya
P.S. you don’t need "$exists": true, since that’s a strict subset of "$ne": null for countryCode in your query.

Thanks all for your suggests and sorry for my late response. I have already seen the suggested courses and tried to have more confidence with explain.
There are a couple of things not clear to me with this case:

  • I have simplified the query focusing only on first match and projection, but in any case the index starting with {__STATE__:1, countryCode:1, ... } seems not to be used. From explain I see as indexName of IXSCAN stage the index {__STATE__:1, createdAt:-1} and I’ m not catching the reason.
  • In any case the query seems very slow (>7s) for our needs, could it be related to number of documents (2 M, size 2 GB; note that there is also a final group by countryCode stage)? If it’s not possible to optimize it with indexes this could means that aggregation is not the best choice in this case?

As always, please run explain("executionStats") on the full aggregation and provide the output here - without seeing what the time is being spent on we would be guessing where the improvements could be best made.

Asya
P.S. if you are on 4.4 or later then full explain will show how much time is being spent in each stage of aggregation.

Following the entire explain output directly on production environment:

{
"stages" : [ 
    {
        "$cursor" : {
            "queryPlanner" : {
                "plannerVersion" : 1,
                "namespace" : "labid.outcomes",
                "indexFilterSet" : false,
                "parsedQuery" : {
                    "$and" : [ 
                        {
                            "__STATE__" : {
                                "$eq" : "PUBLIC"
                            }
                        }, 
                        {
                            "countryCode" : {
                                "$not" : {
                                    "$eq" : null
                                }
                            }
                        }
                    ]
                },
                "queryHash" : "850416C8",
                "planCacheKey" : "58935625",
                "winningPlan" : {
                    "stage" : "PROJECTION_SIMPLE",
                    "transformBy" : {
                        "success" : true,
                        "createdAt" : true,
                        "countryCode" : true,
                        "_id" : false
                    },
                    "inputStage" : {
                        "stage" : "FETCH",
                        "filter" : {
                            "countryCode" : {
                                "$not" : {
                                    "$eq" : null
                                }
                            }
                        },
                        "inputStage" : {
                            "stage" : "IXSCAN",
                            "keyPattern" : {
                                "__STATE__" : 1,
                                "createdAt" : -1
                            },
                            "indexName" : "state_createdAt",
                            "isMultiKey" : false,
                            "multiKeyPaths" : {
                                "__STATE__" : [],
                                "createdAt" : []
                            },
                            "isUnique" : false,
                            "isSparse" : false,
                            "isPartial" : false,
                            "indexVersion" : 2,
                            "direction" : "forward",
                            "indexBounds" : {
                                "__STATE__" : [ 
                                    "[\"PUBLIC\", \"PUBLIC\"]"
                                ],
                                "createdAt" : [ 
                                    "[MaxKey, MinKey]"
                                ]
                            }
                        }
                    }
                },
                "rejectedPlans" : [ 
                    {
                        "stage" : "PROJECTION_SIMPLE",
                        "transformBy" : {
                            "success" : true,
                            "createdAt" : true,
                            "countryCode" : true,
                            "_id" : false
                        },
                        "inputStage" : {
                            "stage" : "FETCH",
                            "filter" : {
                                "countryCode" : {
                                    "$not" : {
                                        "$eq" : null
                                    }
                                }
                            },
                            "inputStage" : {
                                "stage" : "IXSCAN",
                                "keyPattern" : {
                                    "__STATE__" : 1,
                                    "success" : 1
                                },
                                "indexName" : "state_success",
                                "isMultiKey" : false,
                                "multiKeyPaths" : {
                                    "__STATE__" : [],
                                    "success" : []
                                },
                                "isUnique" : false,
                                "isSparse" : false,
                                "isPartial" : false,
                                "indexVersion" : 2,
                                "direction" : "forward",
                                "indexBounds" : {
                                    "__STATE__" : [ 
                                        "[\"PUBLIC\", \"PUBLIC\"]"
                                    ],
                                    "success" : [ 
                                        "[MinKey, MaxKey]"
                                    ]
                                }
                            }
                        }
                    }, 
                    {
                        "stage" : "PROJECTION_SIMPLE",
                        "transformBy" : {
                            "success" : true,
                            "createdAt" : true,
                            "countryCode" : true,
                            "_id" : false
                        },
                        "inputStage" : {
                            "stage" : "FETCH",
                            "inputStage" : {
                                "stage" : "IXSCAN",
                                "keyPattern" : {
                                    "__STATE__" : 1,
                                    "countryCode" : 1
                                },
                                "indexName" : "state_countryCode",
                                "isMultiKey" : false,
                                "multiKeyPaths" : {
                                    "__STATE__" : [],
                                    "countryCode" : []
                                },
                                "isUnique" : false,
                                "isSparse" : false,
                                "isPartial" : false,
                                "indexVersion" : 2,
                                "direction" : "forward",
                                "indexBounds" : {
                                    "__STATE__" : [ 
                                        "[\"PUBLIC\", \"PUBLIC\"]"
                                    ],
                                    "countryCode" : [ 
                                        "[MinKey, undefined)", 
                                        "(null, MaxKey]"
                                    ]
                                }
                            }
                        }
                    }
                ]
            },
            "executionStats" : {
                "executionSuccess" : true,
                "nReturned" : 2188033,
                "executionTimeMillis" : 8133,
                "totalKeysExamined" : 2189018,
                "totalDocsExamined" : 2189018,
                "executionStages" : {
                    "stage" : "PROJECTION_SIMPLE",
                    "nReturned" : 2188033,
                    "executionTimeMillisEstimate" : 1909,
                    "works" : 2189019,
                    "advanced" : 2188033,
                    "needTime" : 985,
                    "needYield" : 0,
                    "saveState" : 2279,
                    "restoreState" : 2279,
                    "isEOF" : 1,
                    "transformBy" : {
                        "success" : true,
                        "createdAt" : true,
                        "countryCode" : true,
                        "_id" : false
                    },
                    "inputStage" : {
                        "stage" : "FETCH",
                        "filter" : {
                            "countryCode" : {
                                "$not" : {
                                    "$eq" : null
                                }
                            }
                        },
                        "nReturned" : 2188033,
                        "executionTimeMillisEstimate" : 1437,
                        "works" : 2189019,
                        "advanced" : 2188033,
                        "needTime" : 985,
                        "needYield" : 0,
                        "saveState" : 2279,
                        "restoreState" : 2279,
                        "isEOF" : 1,
                        "docsExamined" : 2189018,
                        "alreadyHasObj" : 0,
                        "inputStage" : {
                            "stage" : "IXSCAN",
                            "nReturned" : 2189018,
                            "executionTimeMillisEstimate" : 413,
                            "works" : 2189019,
                            "advanced" : 2189018,
                            "needTime" : 0,
                            "needYield" : 0,
                            "saveState" : 2279,
                            "restoreState" : 2279,
                            "isEOF" : 1,
                            "keyPattern" : {
                                "__STATE__" : 1,
                                "createdAt" : -1
                            },
                            "indexName" : "state_createdAt",
                            "isMultiKey" : false,
                            "multiKeyPaths" : {
                                "__STATE__" : [],
                                "createdAt" : []
                            },
                            "isUnique" : false,
                            "isSparse" : false,
                            "isPartial" : false,
                            "indexVersion" : 2,
                            "direction" : "forward",
                            "indexBounds" : {
                                "__STATE__" : [ 
                                    "[\"PUBLIC\", \"PUBLIC\"]"
                                ],
                                "createdAt" : [ 
                                    "[MaxKey, MinKey]"
                                ]
                            },
                            "keysExamined" : 2189018,
                            "seeks" : 1,
                            "dupsTested" : 0,
                            "dupsDropped" : 0
                        }
                    }
                }
            }
        },
        "nReturned" : NumberLong(2188033),
        "executionTimeMillisEstimate" : NumberLong(7433)
    }, 
    {
        "$group" : {
            "_id" : "$countryCode",
            "count" : {
                "$sum" : {
                    "$const" : 1.0
                }
            }
        },
        "nReturned" : NumberLong(203),
        "executionTimeMillisEstimate" : NumberLong(8120)
    }
],
"serverInfo" : {
    "host" : "atlas-iddbm2-shard-00-01.ake5m.gcp.mongodb.net",
    "port" : 27017,
    "version" : "4.4.9",
    "gitVersion" : "b4048e19814bfebac717cf5a880076aa69aba481"
},
"ok" : 1.0,
"$clusterTime" : {
    "clusterTime" : Timestamp(1632748186, 1),
    "signature" : {
        "hash" : { "$binary" : "er1V/UEDRoO+EvH/KZ5Nh2jjYWA=", "$type" : "00" },
        "keyId" : NumberLong(6977486985841606664)
    }
},
"operationTime" : Timestamp(1632748186, 1)

}

This query is aggregating over two million documents - even using an efficient index processing that many documents is going to take time. However here the best index isn’t being used. It doesn’t look like you have an index on the two fields you are querying on (STATE and countryCode).

Asya

That index should be there, below the output of getIndexes() but it seems to be mentioned also on rejectedPlans: I didn’t understand why it is not the winning one.

[
    {
        "v" : 2,
        "key" : {
            "_id" : 1
        },
        "name" : "_id_"
    },
    {
        "v" : 2,
        "key" : {
            "__STATE__" : 1,
            "createdAt" : -1
        },
        "name" : "state_createdAt",
        "background" : false
    },
    {
        "v" : 2,
        "key" : {
            "__STATE__" : 1,
            "success" : 1
        },
        "name" : "state_success",
        "background" : true
    },
    {
        "v" : 2,
        "key" : {
            "tagUid" : 1
        },
        "name" : "tagUid",
        "background" : true
    },
    {
        "v" : 2,
        "key" : {
            "__STATE__" : 1,
            "countryCode" : 1
        },
        "name" : "state_countryCode"
    }
]

This means, as I supposed previously, that on demand aggregation is not to be used in this case? What is the best approach to use? It is ok to set periodic aggregation with output on another collection or there is some approach more effective?

The reason a different index wouldn’t be used is it’s not going to be much more selective - the country code matches in all but one thousand documents, indexes are most helpful when they are selective (i.e. narrow down the number of documents that match).

That’s actually the best approach assuming you know most/many of the aggregations needed. Running aggregations periodically with output going into a “summary” collection is a long accepted way to reduce query time for most popular complex queries.

Asya

1 Like

One thing that occurred to me is that you could simplify your pipeline a little to only $match on __STATE__ and then group on countryCode and then filter out _id being null. Filtering out records where countryCode is null or missing may be taking a lot more time/effort before $group (with minimal reduction in total records processed) when it would be very fast after…

Asya

db.alerts.aggregate([ {
$addFields: {
timestamp: {
$toDate: “$timestamp”
}
}
},
{
$match: {
timestamp: {
$gte: new Date(Date.now() - 24* 60 * 60 * 1000),
$lte: new Date()
},
}
},{
$facet: {
Total_log: [

        { $count: "Total_log" }
      ] } }] ,{ allowDiskUse: true }).toArray();
in this query i want to fetch data last 24 hour but it getting huge time to fetch and in my collection data approx 300 million so i use index also but it taking so much time

The $addFields needs to stream in memory all documents to convert them .

As the match happens after it cannot use an index (data is already in agg memory.

You need to filter on timestamp or have the documents with date values and not timestamps.

Thanks
Pavel

Thanks for confirming what he already knows since June:
https://www.mongodb.com/community/forums/t/queries-with-large-volumes-of-returned-data/212570/6?u=steevej