Searching in array using $elemMatch slower with index than without

Pedro_Cristina · September 2, 2020, 12:38pm

I have a collection with 500k documents with the following structure:

{
    "_id" : ObjectId("5f2d30b0c7cc16c0da84a57d"),
    "RecipientId" : "6a28d20f-4741-4c14-a055-2eb2593dcf13",
    
	...
	
    "Actions" : [ 
        {
            "CampaignId" : "7fa216da-db22-44a9-9ea3-c987c4152ba1",
            "ActionDatetime" : ISODate("1998-01-13T00:00:00.000Z"),
            "ActionDescription" : "OPEN"
        }, 
        ...
    ]
}

I need to count the top level documents whose subdocuments inside the “Actions” array meet certain criteria, and for this I’ve created the following Multikey index (taking only the “ActionDatetime” field as an example):

db.getCollection("recipients").createIndex( { "Actions.ActionDatetime": 1 } )

The problem is that when I write the query using an $elemMatch, the operation is much slower than when I don’t use the Multikey index at all:

db.getCollection("recipients").count({
  "Actions":
    { $elemMatch:{ ActionDatetime: {$gt: new Date("1950-08-04")} }}}
)

The stats for this query:

{
    "executionSuccess" : true,
    "nReturned" : 0,
    "executionTimeMillis" : 13093,
    "totalKeysExamined" : 8706602,
    "totalDocsExamined" : 500000,
    "executionStages" : {
        "stage" : "COUNT",
        "nReturned" : 0,
        "executionTimeMillisEstimate" : 1050,
        "works" : 8706603,
        "advanced" : 0,
        "needTime" : 8706602,
        "needYield" : 0,
        "saveState" : 68020,
        "restoreState" : 68020,
        "isEOF" : 1,
        "nCounted" : 500000,
        "nSkipped" : 0,
        "inputStage" : {
            "stage" : "FETCH",
            "filter" : {
                "Actions" : {
                    "$elemMatch" : {
                        "ActionDatetime" : {
                            "$gt" : ISODate("1950-08-04T00:00:00.000Z")
                        }
                    }
                }
            },
            "nReturned" : 500000,
            "executionTimeMillisEstimate" : 1040,
            "works" : 8706603,
            "advanced" : 500000,
            "needTime" : 8206602,
            "needYield" : 0,
            "saveState" : 68020,
            "restoreState" : 68020,
            "isEOF" : 1,
            "docsExamined" : 500000,
            "alreadyHasObj" : 0,
            "inputStage" : {
                "stage" : "IXSCAN",
                "nReturned" : 500000,
                "executionTimeMillisEstimate" : 266,
                "works" : 8706603,
                "advanced" : 500000,
                "needTime" : 8206602,
                "needYield" : 0,
                "saveState" : 68020,
                "restoreState" : 68020,
                "isEOF" : 1,
                "keyPattern" : {
                    "Actions.ActionDatetime" : 1.0
                },
                "indexName" : "Actions.ActionDatetime_1",
                "isMultiKey" : true,
                "multiKeyPaths" : {
                    "Actions.ActionDatetime" : [ 
                        "Actions"
                    ]
                },
                "isUnique" : false,
                "isSparse" : false,
                "isPartial" : false,
                "indexVersion" : 2,
                "direction" : "forward",
                "indexBounds" : {
                    "Actions.ActionDatetime" : [ 
                        "(new Date(-612576000000), new Date(9223372036854775807)]"
                    ]
                },
                "keysExamined" : 8706602,
                "seeks" : 1,
                "dupsTested" : 8706602,
                "dupsDropped" : 8206602
            }
        }
    }
}

This query took 14sec to execute, whereas if I remove the index, the COLLSCAN takes 1 second.

I understand that I’d have a better performance by not using $elemMatch, and filtering by “Actions.ActionDatetime” directly, but in reality I’ll need to filter by more than one field inside the array, so the $elemMatch becomes mandatory.

I suspect that it’s the FETCH phase which is killing the performance, but I’ve noticed that when i use the “Actions.ActionDatetime” directly, MongoDB is able to use a COUNT_SCAN instead of the fetch, but the performance is still poorer than the COLLSCAN (4s).

I’d like to know if there’s a better indexing strategy for indexing subdocuments with high cardinality inside an array, or if I’m missing something with my current approach.
As the volume grows, indexing this information will be a necessity and I don’t want to rely on a COLLSCAN.

Pavel_Duchovny · September 4, 2020, 6:52am

Hi @Pedro_Cristina,

The execution plan you have posted shows that the query eventually had to count all documents (500k) as the used criteria probably did not filtered any documents out.

This means that for the same work that a collscan will do you also had to scan all index keys from the index file and do a fetch of each document to count it.

For this predict there is no doubt that just scanning all documents is much faster considering that MongoDB code has many optimizations to COLLSCANS as those used in critical areas as Initial Syncs of replica sets and have to be optimized compare to the resulted full index scan!

Please test the query with predicts who really narrow down the query results.

Best
Pavel