Performance drop when adding an OR in a Match step

Hi, we had simple Match…Group aggregation pipeline which has strange performance characteristics and thought it may be a bug/poor implementation that I wanted to flag.

Basically we have a collection of 300k documents containing (not only) the following fields:

    _id : <string>
    metadata : {
           event_type : <string> - one of published, deleted, updated. Vast majority (95%+) is published
           ... other fields ...
    }
   series_id : {
       id : <string>
       type : <string>
   }
    ... other fields

We wanted to run a query to return all unique series_id.id, excluding event_type = deleted. We have indexes on both series_id.id and event_type

A simple Group(series_id.id) is fast (<1s)
The obvious Match : ne(deleted) → Group (series_id.id) is very slow (12s)
Running Match: or([“published”,“updated”]) → Group is also very slow

However running Match : “published” → Group (series_id.id) is fast, as is running this with “updated”.

I can’t see an obvious reason why adding the OR in the match on an indexed field should be over 12x slower than running the two queries independently (and merging the results on the client). Is this expected behaviour?

Please publish sample documents, the exact pipeline and all the indexes (with getIndexes()).

If you look at the $ne documentation in the last paragraph you will read:

The inequality operator $ne is not very selective since it often matches a large portion of the index. As a result, in many cases, a $ne query with an index may perform no better than a $ne query that must scan all documents in a collection. See also Query Selectivity.

A partial index with your $ne:deleted might help.

Can’t save as a file as I’m too new, but embedded example of 3 docs are below.
I found that bit of the doc pages from a StackOverflow search, and I’m OK with that explaining the $ne result, but it doesn’t really explain why performing an $or query should be so slow.

The $or query:

db.getCollection('petchem_price_assessments_2').aggregate([{ "$match" : { "metadata.event_type" : { "$in" : ["published", "updated"] } } }, { "$group" : { "_id" : "$series_id.id" } }]).toArray()

The indices (I also tried adding a compond index on event_type_1_series_id.id_1 (and the other way around) with no impact)

[
    {
        "v" : 2,
        "key" : {
            "_id" : 1
        },
        "name" : "_id_",
        "ns" : "pricing-local.petchem_price_assessments_2"
    },
    {
        "v" : 2,
        "key" : {
            "metadata.event_type" : 1
        },
        "name" : "metadata.event_type_1",
        "ns" : "pricing-local.petchem_price_assessments_2"
    },
    {
        "v" : 2,
        "key" : {
            "series_id.id" : 1,
            "created_for" : 1
        },
        "name" : "series_id.id_1_created_for_1",
        "ns" : "pricing-local.petchem_price_assessments_2"
    }
]

Example data:

/* 48 */
{
    "_id" : "energy-darwin_10-pricehistory-20200807000000",
    "metadata" : {
        "event_type" : "published",
        "partition_key" : "energy-darwin_10-pricehistory-20200807000000",
        "correlation_id" : "energy-darwin_10-pricehistory-20200807000000",
        "id" : "energy-darwin_10-pricehistory-20200807000000",
        "source" : "platform_cms",
        "type" : "icis.petchem.prices.price_assessments",
        "event_time" : "1634589908000",
        "released_on" : "1596801900000"
    },
    "base_type" : "series-item",
    "released_on" : NumberLong(1596801900000),
    "created_on" : NumberLong(1596798329000),
    "descriptor_id" : "price-range",
    "domain_id" : "energy-darwin",
    "series_id" : {
        "id" : "energy-darwin_10",
        "type" : "icis.petchem.series.price_assessments_specifications"
    },
    "created_for" : NumberLong(1596758400000),
    "contract_period" : [ 
        {
            "lang" : "en",
            "value" : ""
        }, 
        {
            "lang" : "zh",
            "value" : ""
        }
    ],
    "period_start_date" : NumberLong(0),
    "period_end_date" : NumberLong(0),
    "factory_price_range" : [],
    "price_range" : [ 
        {
            "assessment_low" : 2880.0,
            "assessment_mid" : 2905.0,
            "assessment_high" : 2930.0,
            "assessment_low_delta" : 50.0,
            "assessment_high_delta" : 50.0,
            "market_time" : {
                "id" : "",
                "type" : "icis.petchem.reference_data.market_time"
            },
            "iosco_data_used" : "",
            "osp" : null,
            "delta_type" : {
                "id" : "delta-type_regular",
                "type" : "icis.petchem.reference_data.delta_type"
            },
            "versus_dated" : null
        }
    ],
    "single_price" : []
}

/* 49 */
{
    "_id" : "energy-darwin_10-pricehistory-20200810000000",
    "metadata" : {
        "event_type" : "published",
        "partition_key" : "energy-darwin_10-pricehistory-20200810000000",
        "correlation_id" : "energy-darwin_10-pricehistory-20200810000000",
        "id" : "energy-darwin_10-pricehistory-20200810000000",
        "source" : "platform_cms",
        "type" : "icis.petchem.prices.price_assessments",
        "event_time" : "1634589984000",
        "released_on" : "1597061100000"
    },
    "base_type" : "series-item",
    "released_on" : NumberLong(1597061100000),
    "created_on" : NumberLong(1597056012000),
    "descriptor_id" : "price-range",
    "domain_id" : "energy-darwin",
    "series_id" : {
        "id" : "energy-darwin_10",
        "type" : "icis.petchem.series.price_assessments_specifications"
    },
    "created_for" : NumberLong(1597017600000),
    "contract_period" : [ 
        {
            "lang" : "en",
            "value" : ""
        }, 
        {
            "lang" : "zh",
            "value" : ""
        }
    ],
    "period_start_date" : NumberLong(0),
    "period_end_date" : NumberLong(0),
    "factory_price_range" : [],
    "price_range" : [ 
        {
            "assessment_low" : 2980.0,
            "assessment_mid" : 3005.0,
            "assessment_high" : 3030.0,
            "assessment_low_delta" : 100.0,
            "assessment_high_delta" : 100.0,
            "market_time" : {
                "id" : "",
                "type" : "icis.petchem.reference_data.market_time"
            },
            "iosco_data_used" : "",
            "osp" : null,
            "delta_type" : {
                "id" : "delta-type_regular",
                "type" : "icis.petchem.reference_data.delta_type"
            },
            "versus_dated" : null
        }
    ],
    "single_price" : []
}

/* 50 */
{
    "_id" : "energy-darwin_10-pricehistory-20200811000000",
    "metadata" : {
        "event_type" : "published",
        "partition_key" : "energy-darwin_10-pricehistory-20200811000000",
        "correlation_id" : "energy-darwin_10-pricehistory-20200811000000",
        "id" : "energy-darwin_10-pricehistory-20200811000000",
        "source" : "platform_cms",
        "type" : "icis.petchem.prices.price_assessments",
        "event_time" : "1634590028000",
        "released_on" : "1597147500000"
    },
    "base_type" : "series-item",
    "released_on" : NumberLong(1597147500000),
    "created_on" : NumberLong(1597143737000),
    "descriptor_id" : "price-range",
    "domain_id" : "energy-darwin",
    "series_id" : {
        "id" : "energy-darwin_10",
        "type" : "icis.petchem.series.price_assessments_specifications"
    },
    "created_for" : NumberLong(1597104000000),
    "contract_period" : [ 
        {
            "lang" : "en",
            "value" : ""
        }, 
        {
            "lang" : "zh",
            "value" : ""
        }
    ],
    "period_start_date" : NumberLong(0),
    "period_end_date" : NumberLong(0),
    "factory_price_range" : [],
    "price_range" : [ 
        {
            "assessment_low" : 2980.0,
            "assessment_mid" : 3005.0,
            "assessment_high" : 3030.0,
            "assessment_low_delta" : 0.0,
            "assessment_high_delta" : 0.0,
            "market_time" : {
                "id" : "",
                "type" : "icis.petchem.reference_data.market_time"
            },
            "iosco_data_used" : "",
            "osp" : null,
            "delta_type" : {
                "id" : "delta-type_regular",
                "type" : "icis.petchem.reference_data.delta_type"
            },
            "versus_dated" : null
        }
    ],
    "single_price" : []
}

Please share the explain plans for …event_type:published, …event_type:updated and the one for the $in variation.

With compound index event_type_1_series_id.id_1, a $project of series_id.id might help.

What I suspect is that the working set does not fit in RAM so a lot more of disk I/O occurs with $in.

Sure:
The distribution of event_types will be vast majority (95%+) “published” - if there were memory issues I’d expect match event_type:“published” to also be slow, but it’s a fast-running query

    "stages" : [ 
        {
            "$cursor" : {
                "query" : {
                    "metadata.event_type" : {
                        "$in" : [ 
                            "published", 
                            "updated"
                        ]
                    }
                },
                "fields" : {
                    "series_id.id" : 1,
                    "_id" : 0
                },
                "queryPlanner" : {
                    "plannerVersion" : 1,
                    "namespace" : "pricing-systest.petchem_price_assessments_2",
                    "indexFilterSet" : false,
                    "parsedQuery" : {
                        "metadata.event_type" : {
                            "$in" : [ 
                                "published", 
                                "updated"
                            ]
                        }
                    },
                    "queryHash" : "D2963EBC",
                    "planCacheKey" : "AB3E9A31",
                    "winningPlan" : {
                        "stage" : "PROJECTION_DEFAULT",
                        "transformBy" : {
                            "series_id.id" : 1,
                            "_id" : 0
                        },
                        "inputStage" : {
                            "stage" : "IXSCAN",
                            "keyPattern" : {
                                "metadata.event_type" : 1,
                                "series_id.id" : 1,
                                "_id" : 1
                            },
                            "indexName" : "test1",
                            "isMultiKey" : false,
                            "multiKeyPaths" : {
                                "metadata.event_type" : [],
                                "series_id.id" : [],
                                "_id" : []
                            },
                            "isUnique" : false,
                            "isSparse" : false,
                            "isPartial" : false,
                            "indexVersion" : 2,
                            "direction" : "forward",
                            "indexBounds" : {
                                "metadata.event_type" : [ 
                                    "[\"published\", \"published\"]", 
                                    "[\"updated\", \"updated\"]"
                                ],
                                "series_id.id" : [ 
                                    "[MinKey, MaxKey]"
                                ],
                                "_id" : [ 
                                    "[MinKey, MaxKey]"
                                ]
                            }
                        }
                    },
                    "rejectedPlans" : []
                }
            }
        }, 
        {
            "$group" : {
                "_id" : "$series_id.id"
            }
        }
    ],

Index wise I see nothing except that having _id is not useful.

_id not useful because it is projected out anyway as seen in

Not necessarily, they could all fit but the little extra from the other event_type might cause a cascade of disk I/O.

Please share collections stats using https://www.mongodb.com/docs/manual/reference/method/db.collection.stats/ and the characteristics of your installation, RAM, disk. Is client running on same machine as the server? Any other database or collection of any significance on the same machine?