Adding the projected fields to compound index (to create covered query) doesn't improve performance

Keren-Or_Curtis · November 14, 2022, 4:16pm

Hello all,
I have a question about the fields of a compound index, hopefully this great community may enlighten me a bit here

My data can be modelled as

{
 _id: <unique_id>
 name: <some str, under 256 chars, not necessarily unique>
 field_a: <some str, under 256 chars, not necessarily unique> 
 field_b: <some str, under 256 chars, not necessarily unique>
 attributes: <array of strings - each under 32 chars, max size of array typically under 4 elements>
 ... a bunch of other fields
}

I need to search regex in name (not ideal, I know) and return field_a, field_b and _id.
Sometimes, not always, I would want to search a regex in name and also return only records that have attribute "a" in their attributes list.
In both cases I would want to sort by name.
These are the 2 typical query commands -

command: find { find: "COLL", filter: { name: { $regex: "part" }}, sort: { name: 1 }, projection: { field_a: 1, field_b: 1, _id: 1 }

command: find { find: "COLL", filter: { name: { $regex: "part" }, attributes: "a" }, sort: { name: 1 }, projection: { field_a: 1, field_b: 1, _id: 1 }

Originally I used the following a compound index

{
    "name" : 1,
    "field_a" : 1,
    "field_b" : 1,
    "_id" : 1,
    "attributes" : 1
}

which would give me a covered query for both queries, not needing to fetch.
However, this is significantly slower (on both types of queries I have) than querying with an index that doesn’t include "attributes"

{
    "name" : 1,
    "field_a" : 1,
    "field_b" : 1,
    "_id" : 1
}

Difference in times 100 sec vs 6 sec for 5.1M records.

Is it because of how mongoDB indexes arrays? The arrays, again, contain very few elements, so it’s a bit weird for me to see such a huge difference.

Another interesting observation is that indexing just {name: 1} rather than a compound index provides slightly better performance in both queries (around 4 sec, vs 6 sec), even though in this case it needs to fetch field_a, field_b, _id from each record instead of having them covered in the index. Why is that?

for what it’s worth, I still use mongoDB 4.0.10, driver is pymongo 3.7.1

Many thanks!

Jason_Tran · November 16, 2022, 8:31pm

Hi @Keren-Or_Curtis - Welcome to the community

Regarding your regex, you could possibly further optimize it using a “prefix expression”. More details in the $regex index use documentation.

which would give me a covered query for both queries, not needing to fetch.

As noted in the Multikey indexes limitations documentation:

Multikey indexes cannot cover queries over array field(s).

I believe if you run an explain plan against the queries, you will see that the isMultiKey value is true for the index that includes the array field "attributes".

Just to clarify, would be able to post some details regarding the method you used to determine that this is a covered query?

command: find { find: “COLL”, filter: { name: { $regex: “part” }}, sort: { name: 1 }, projection: { field_a: 1, field_b: 1, _id: 1 }

For this particular query, the index below results in a PROJECTION_COVERED (based off my test environment with around 140K documents):

{
    "name" : 1,
    "field_a" : 1,
    "field_b" : 1,
    "_id" : 1
}

Based off the test environment containing documents with the same fields and indexes, a PROJECTION_SIMPLE (i.e. not a covered query) will occur due to the multikey index limitation noted in my reply for the index including the additional array field "attributes".

Another interesting observation is that indexing just {name: 1} rather than a compound index provides slightly better performance in both queries (around 4 sec, vs 6 sec), even though in this case it needs to fetch field_a, field_b, _id from each record instead of having them covered in the index. Why is that?

There could be many explanations for the difference in execution timing, such as it’s possible that one index is in memory and one isn’t, or even perhaps hardware-specific reasons. However, in saying so, could you provide the following regarding the above:

db.collection.getIndexes() output
explain("allPlansExecution") output for each of the queries
db.collection.stats() output

Please note that with my above testing, I did not add the “other fields” to my test documents or query you had mentioned so results may differ. I am also testing on version 5.0.13 of MongoDB.

for what it’s worth, I still use mongoDB 4.0.10, driver is pymongo 3.7.1

Lastly, MongoDB version 4.0 has reached end of life on April 2022. Please test again on a support version as MongoDB is constantly being improved, so doing this test on a supported version would be more representative.

Regards,
Jason

Keren-Or_Curtis · November 17, 2022, 6:59am

Hi, thank you for the reply!

Indeed, I try to utilize it whenever possible, however I’m afraid certain usecases require me to search the entire expression.

But also, on that same page, it says

since attributes is the array field that causes the index to be multikey, at least my first type of query (the one that doesn’t filter by attributes) should have still be covered? (I realized looking more closely at the explain output it indeed isn’t, but wondering why, considering that part of the documentation?)

I will clarify my test a bit - I replaced the index each time, I didn’t add more test indexes. I ran each query at least 3 consecutive times to eliminate the initial longer time of loading the index to memory.
That said, I do understand the time difference is perhaps not that meaningful, but the single-field index surely doesn’t provide a covered query, isn’t it a bit weird?

I am dreadfully aware Due to some legacy restrictions I currently have to work with 4.0, I guarantee you I’m working to get these upgraded. If this in itself is considered an issue impeding the results, I understand (and may return to this topic once upgraded)

I will provide the outputs per test-index. They are very lengthy as you can imagine so I wanted to attach them as files, regretfully new users cannot upload files so I put it on google drive (honestly that’s a lot of output for a comment)

summarizing again,

the queries (both types of queries) take ~100sec with the full index, which is a multikey index, and apparently not covered - perhaps my misunderstanding of the documentation there
both queries take ~6 seconds with the no-attributes-index, despite it needing to fetch attributes for the query that filters according to it
both queries take ~4 seconds with the only-name index, despite it needing to fetch all the projection fields

note I am fine with 4-6 seconds, just aiming to understand this behaviour a bit more

Thanks!

Jason_Tran · November 21, 2022, 11:51pm

Hi Keren - Thanks for providing those requested details.

since attributes is the array field that causes the index to be multikey, at least my first type of query (the one that doesn’t filter by attributes) should have still be covered? (I realized looking more closely at the explain output it indeed isn’t, but wondering why, considering that part of the documentation?)

I did some testing with MongoDB version 4.0 and it seems even if the projection is covered, the stage from the execution stats output will still show as 'PROJECTION' so my apologies there for any confusion caused (My test environment using version 5.0.13 displays the particular stage as PROJECTION_COVERED). However, in saying so, you can possibly try inspecting the totalDocsExamined value for your queries to determine if the query is covered or not. If the totalDocsExamined value is 0, then the query was most likely covered (there is an exception for when the query returns no results). This can be seen with the "no-attributes-index" output for the query that does not contain "attributes":

db.getCollection('COLL').find({name: { $regex: "one" }}, { field_a: 1, field_b: 1, _id: 1 })

"executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 2048,
        "executionTimeMillis" : 6023,
        "totalKeysExamined" : 5118976,
        "totalDocsExamined" : 0

Compared with db.getCollection('COLL').find({name: { $regex: "one" }, attributes: "a"}, { field_a: 1, field_b: 1, _id: 1 }).sort({name: 1}) which isn’t covered

 "executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 2048,
        "executionTimeMillis" : 18527,
        "totalKeysExamined" : 5118976,
        "totalDocsExamined" : 2048

the queries (both types of queries) take ~100sec with the full index, which is a multikey index, and apparently not covered - perhaps my misunderstanding of the documentation there

If we inspect the execution stats output for full index output for both queries, we can see that the server scanned 5.1M index keys and also 5.1M documents. Interestingly enough, this number sounds like the whole collection and if this is the case, my guess is that performing a collection scan may be faster in this particular scenario if the same 5.1M are needed to be scanned (without needing to inspect any index keys).:

"executionStats" : {
        "executionSuccess" : true,
        "nReturned" : 2048,
        "executionTimeMillis" : 98361,
        "totalKeysExamined" : 5118976,
        "totalDocsExamined" : 5118976

both queries take ~6 seconds with the no-attributes-index, despite it needing to fetch attributes for the query that filters according to it
both queries take ~4 seconds with the only-name index, despite it needing to fetch all the projection fields

I do see why this would appear quite strange as inspecting the execution stats you can see the query / index combinations requiring a fetch are slightly faster than the covered query. The query selectivity drastically impacts the performance. For your reference, on a test environment with 1M documents, I compared 2 $regex queries, one with an anchor:

db.collection.find({'name':{$regex:"Est"}},{"name": 1, "field_a": 1, "field_b": 1, "_id": 1}).sort({"name":1}).explain("executionStats"):

executionStats: {
    executionSuccess: true,
    nReturned: 8090,
    executionTimeMillis: 994,
    totalKeysExamined: 1000000,
    totalDocsExamined: 0,
    executionStages: {
      stage: 'PROJECTION_COVERED'

(with anchor)
db.collection.find({'name':{$regex:"^Est"}},{"name": 1, "field_a": 1, "field_b": 1, "_id": 1}).sort({"name":1}).explain("executionStats"):

executionStats: {
    executionSuccess: true,
    nReturned: 6061,
    executionTimeMillis: 12,
    totalKeysExamined: 6062,
    totalDocsExamined: 0,
    executionStages: {
      stage: 'PROJECTION_COVERED'

Note the executionTimeMillis difference between an anchored an non-anchored regex queries. Also the number of totalKeysExamined, where the non-anchored regex query is scanning the whole keyspace: 1000000 scanned vs 8000 returned, an average of 0.008 document returned per index key scanned, and the anchored query is scanning 6062 keys and returns 6061 documents, almost 1:1 ratio, which means that the server does not do unnecessary work.

On an additional test with anchors, I found that a full index was faster than the single field index.

Although I do understand you have stated your particular use case requires the full expression to be searched. If this is a frequent operation, you may wish to consider using Atlas Search (although I presume you are on-prem due to the MongoDB version stated ).

Regards,

Keren-Or_Curtis · November 22, 2022, 3:52pm

Hi Jason, thanks for your input once again! It is very helpful.

It is interesting to note that the original index not only doesn’t cover the query, it also doesn’t utilize the fact "name" is part of the index.
I can deduce it from the number of examined docs & see it in the query plan of the full index -

"winningPlan" : {
            "stage" : "PROJECTION",
            "transformBy" : {
                "field_a" : 1.0,
                "field_b" : 1.0,
                "_id" : 1.0
            },
            "inputStage" : {
                "stage" : "FETCH",
                "filter" : {
                    "name" : {
                        "$regex" : "one"
                    }
                },
                "inputStage" : {
                    "stage" : "IXSCAN",
                    "keyPattern" : {
                        "name" : 1,
                        "field_a" : 1,
                        "field_b" : 1,
                        "_id" : 1,
                        "attributes" : 1
                    },
                    "indexName" : "playIdx",
                    "isMultiKey" : true,
                    "multiKeyPaths" : {
                        "name" : [],
                        "field_a" : [],
                        "field_b" : [],
                        "_id" : [],
                        "attributes" : [ 
                            "attributes"
                        ]
                    },
                    "isUnique" : false,
                    "isSparse" : false,
                    "isPartial" : false,
                    "indexVersion" : 2,
                    "direction" : "forward",
                    "indexBounds" : {
                        "name" : [ 
                            "[\"\", {})", 
                            "[/one/, /one/]"
                        ],
                        "field_a" : [ 
                            "[MinKey, MaxKey]"
                        ],
                        "field_b" : [ 
                            "[MinKey, MaxKey]"
                        ],
                        "_id" : [ 
                            "[MinKey, MaxKey]"
                        ],
                        "attributes" : [ 
                            "[MinKey, MaxKey]"
                        ]
                    }
                }
            }

it is doing an index scan - and then does a FETCH (which fetches the documents of course) and filters according to the regex.
- As we saw, it doesn’t fetch if "attributes" is not part of the index - then, it examines the index keys for the regex.
- This is my main mystery - Why does it not examine the index keys for the regex also in the case "attributes" is part of the index , considering the "name" is part of the index? This almost feels like a bug.
- If you still have a similar test environment, I’d really appreciate seeing the output of the explain for a query that corresponds to db.getCollection('COLL').find({name: { $regex: "one" }}, { field_a: 1, field_b: 1, _id: 1 }), when the index is the full index (including the attributes field). I wonder if it behaves differently in newer mongoDB versions
If the query is an anchored regex no documents are fetched, and the query planner shows it filters according to name in the IDXSCAN stage (as we expect, and naturally much faster).
- I’m really struggling to understand why is the stage of fetching "name" from the document itself needed in the not-anchored query when "name" is part of the index - I’d understand needing to examine all the index-keys for this particular use-case, but it actually goes and examines all the documents as well and it is not doing that if "attributes" is not part of the index, nor if the regex is anchored. Clearly I’m not understanding something about MultiKey indexes here and I’d love to figure out what it is

True, this is also very important to note, thanks.

About the difference between the two other indexes, it looks as follows (numbers are slightly different than before since I changed the attributes so that only half of the DB will have the required attribute “a”)-

Search regex without attributes filtering : 
							| Time	 | IdxKey examined	| Docs examined	| total returned
-----------------------------------------------------------------------------------------
1. Full Idx					| 101.85 |	5118976			| 5118976		| 2048
2. Idx without attributes	| 5.845	 | 	5118976			| 0				| 2048
3. Idx = just name			| 4.175	 | 	5118976			| 2048			| 2048
				
Search regex with attributes filtering:

							| Time	 | IdxKey examined	| Docs examined	| total returned
-----------------------------------------------------------------------------------------
4. Full Idx					| 55.675 |	5118976			| 2559488		| 2048
5. Idx without attributes	| 6.08	 | 	5118976			| 2048			| 2048
6. Idx = just name			| 4.34	 | 	5118976			| 2048			| 2048

I think the interesting difference is between 2 & 3. [2] is a covered query, yet it takes slightly more time than [3]. Perhaps the index-size itself is a possible reason for slowdown? The index of 2 & 5 is ~8 times heavier than the simpler index of 3 & 6, so I may answer myself here by saying that despite the fact it may give a more optimized plan, in the practical sense it’s still not the best performance

Thanks again! I appreciate all your input, while I will certainly look into Atlas Search (for a hopeful future in which I won’t need to maintain an on-prem DB anymore, as you correctly guessed), I am more interested in more thoroughly understanding the behaviors we’re experiencing, and then decide if/how I’d change the indexes.

Best regards