Index selection - perceived wrong choice

John_Cark · April 29, 2020, 10:48am

Hi all,

I have a aggregation which is fairly straightforward which is $match filedA, with #GTE 0 for fieldB. Not that fieldB cannot be negative the $sort on FiledB.

The indexes are AB and B. Field A has high cardinality (almost unique) and there are about 1 million documents in the collection. For some reason Mongo chooses the index on fieldB, and does a complete index scan with a fetch to each document to check the value of fieldA.

Any ideas why this may be. It does work find after removing the index on fieldB and the performance is literally thousands of times better.

Is Mongo aware of stats such as volumes, cardinality etc as would be the case (for example) in DB2.

Anything welcome, even flaming

Gaetan_MORLET · April 29, 2020, 7:28pm

Welcome to the community @John_Cark !

You don’t give us very precise information to understand where it may come from. Can you share your aggregation pipeline with us ? And also all of your indexes please.

A temporary solution, maybe we won’t look any further, you can use hint to force your query to use a specific index.
It’s not flaming but it works

John_Cark · April 30, 2020, 6:19am

Hi Gaetan,
Thanks very much for coming back to me on this and I hope that my reply to you is not too verbose 
At the bottom is the log text for the COMMAND in question. While the issue has now been solved (see below), we went through some stages:

There were no indexes (except _id) on the collection. When run, the command (obviously) caused a COLLSCAN which took around 1,100 ms.
The user then added two indexes, one on rootOperation.path and one on timestamp. This is what resulted in the log entry below. Mongo obviously chose the index on timestamp (probably due to the sort) which caused a whole index scan followed by a fetch for each document to check the value of rootOperation.path.
We then added timestamp to the rootOperation.path index, however this made no difference and Mongo continued to choose the timestamp index.
Finally, we deleted the timestamp index (so the only index remaining was rootOperation.path_timestamp and) this fixed the issue to the extent that the COMMAND is not reported upon (less than 100 ms).
In my opinion, given that the cardinality of rootOperation.path is very high (in this case nreturned 0), as soon as there is an index with it as a high level selection, then it should be chosen. What I have also learned (hopefully always learning  ) is that Mongo does not use statistics in the same manner as DB2 etc.
I take your point about hints, however I can only recommend there and to be honest, I am a bit of a DBMS purest when it comes to hints so I do not use them on Oracle for example 
In any case, any ideas highly appreciated.

Regards

John

2020-04-28T15:22:31.487+0000 I COMMAND [conn4450390] command spline_galactic.lineages_v4 command: aggregate { aggregate: “lineages_v4”, pipeline: [ { $match rootOperation.path: “hdfs://########/bigdatahdfs/########/raw/flex/UG/FN_UDF_DTL/2020/04/25/v1/_INFO”, timestamp: { $gte: 0 } } }, { $sort: { timestamp: 1 }], cursor: {}, $db: “spline_galactic”, $clusterTime: { clusterTime: Timestamp(1588087347, 2), signature: { hash: BinData(0, C952C5C5FDE91EB8C511A8CDDEF672DD983E5), keyId: 6784207444567392257 } }, lsid: { id: UUID(“12fb896b-abda-4462-a00b-41e32d4100b7”) } } planSummary: IXSCAN { timestamp: 1 } keysExamined:957557 sExamined:957557 cursorExhausted:1 numYields:7483 nreturned:0 reslen:241 locks:{ Global: { acquireCount: { r: 7485 } }, Database: { acquireCount: { r: 7485 } Collection: { acquireCount: { r: 7485 } } } storage:{ data: { bytesRead: 81822514, timeReadingMicros: 142455 } } protocol:op_msg 3046ms

Prasad_Saya · April 30, 2020, 10:44am

For some reason Mongo chooses the index on fieldB. Any ideas why this may be.

This might be the reason: Sort and Non-prefix Subset of an Index

An index can support sort operations on a non-prefix subset of the index key pattern. To do so, the query must include equality conditions on all the prefix keys that precede the sort keys.

What is an index prefix? See Compound Indexes - Prefixes

It does work find after removing the index on fieldB…

I am not sure about this.

But, I just tried the aggregation query on a small dataset with similar data and indexes (both the indexes). The query optimizer always used the index on both fields a+b, and used the index on both the match and sort operations.

I am using MongoDB version 4.2.

chris · April 30, 2020, 11:23am

Hi @John_Cark,
It is possible that the planner had not yet reevaluated the plan for the query. Check this page appropriate for your mongo version: https://docs.mongodb.com/manual/core/query-plans/

I once encountered a similar issue, albeit on 3.4. I thought this should have evaluated to using a different index but it had not. Using db.collection.getPlanCache().clear() helped in that situation. One with more finesse would have been planCacheClear using the query shape.

It may not help in your scenario, but I thought I would mention it.

John_Cark · April 30, 2020, 11:49am

Thank you Prasad. Indeed there is an equals predicate (rootOperationPath) before the SORT field which is timestamp. Given the high cardinality of the former field though, I would have expected it to use the compound key even if it had to do a small sort after fetching.

As a matter of interest though, I had a similar situation today where we needed a new index. The only difference was hat there were three fields in the index and the search was with equality sorting on the third one. In this case it worked fine.

BTW using 4.0

Regards

John