Compound Index Not Being Used

Tom_Hartwell · August 12, 2024, 3:23pm

I have a MongoDB performance problem that I’m having trouble wrapping my head around. While I have plenty of experience with SQL tuning, Mongo tuning is new to me. A lot of the concepts translate, including the approach to troubleshooting, so hopefully my approach to debug this makes sense and some of you with more experience than me might know what’s going on.

My application that exposes this data is meant to be a search tool for the end user. It is too early to know which attributes will be most searched and in what combination, as I don’t have any users yet. The end user will also have the ability to sort by 1 attribute for a search.

To provide cursor style pagination, I believe I must also have a secondary sort on “_id” to keep the sort “stable”. Users have the ability to “load more” in the UI, which will make use of a cursor that provides the sort value.

Collection"myCollection" - approximately 500k documents

Fields

Filter Fields
“field1” - optionally present in document, only filter on existence
“field2” - optionally present in document, only filter on existence
“field3” - enum of 5 strings, filter using $in
“field4” - number, only filter using $lte
“field5” - number, filter using range ($lte, $gte)
“field6” - number, filter using range ($lte, $gte)
“field7” - number, filter using range ($lte, $gte)
“field8” - number, only filter using $lte

Sort Fields
“_id” - default sort
“field3” - same as field in filtering, enum of 5 strings
“field5” - same as field in filtering
“field6” - same as field in filtering
“field7” - same as field in filtering
“field9” - number"field10" - enum of 4 strings

In the future there may be other fields, but the short of it is: lots of fields to filter on and one of potentially several fields to sort on.

My understanding on how to index this effectively is

use a compound index for each combination that is expected to be common used in filter and sorting
use single indexes or a wildcard index to cover less common cases
However, in testing a compound index, I’m noticing the performance is not any better. Details below.

Let’s assume the user will search on

{
  "field1": {
    "$exists": true
  },
  "field4": {
    "$lte": 100000
  },
  "field3": {
    "$in": [
      "val1",
      "val2",
      "val3",
      "val5"
    ]
  },
  "field2": {
    "$exists": true
  }
}

and that the user enters no sort so the system will sort by “_id” by default for predictable ordering

db.myCollection.createIndex({ field1: 1, "field4": 1, "field3": 1, "field2": 1, "_id": 1 })

Sort should go last, right?

I also created single attribute indexes for these 4 fields and the “_id” is automatically indexed

db.myCollection.createIndex({ field1: 1})
db.myCollection.createIndex({ field2: 1})
db.myCollection.createIndex({ field3: 1})
db.myCollection.createIndex({ field4: 1})

However, when running the following query, Mongo doesn’t make use of this index and takes between 5-15s to complete. If I remove the sort, Mongo makes use of the single field (field1) index and takes less than 200ms to complete.

db.myCollection.find(
    {"field1":{"$exists":true},"field4":{"$lte":100000},"field3":{"$in":["val1","val2","val3","val5"]},"field2":{"$exists":true}}
).sort({'_id': 1}).limit(100).explain('executionStats')

Before I add more specifics, like the output of the “explain” above, please let me know if there is something obvious that an engineer with very little Mongo experience would not know about. I’ve tried my best to experiment with different indexes, to review all the official docs on this subject and have searched around for recommendations/explanations, and this is still stumping me.

steevej · August 13, 2024, 2:41pm

The use-case above looks like it needs the attribute pattern.

Not according to ESR rule.

Tom_Hartwell · August 19, 2024, 2:14am

@steevej - Thanks for linking the ESR rule, I’m surprised I didn’t come across that strategy earlier when I was looking for the correct approach.

The ESR approach will help out in a big way. After adding some indexes focusing on the “E” in ESR, the 500k data set are back to a snappier query times and the test queries that I’m working with are using the indexes that I’ve created.

However, as a follow up, I do worry that for all queries to be efficient as the data set grows and as more fields become searchable, I’d have to create 50 indexes, potentially a lot more. Are there best practice strategies when running into this problem? Perhaps splitting up the document into separate collections? For my use case, writes are done via async processing jobs and are not done via user interaction, so keeping indexes updated are not a concern when it comes to write speed.

I don’t think the attribute pattern is applicable to my use case. Referring to my fields above in the original post, these fields cannot be combined together in the way that the attribute pattern describes. They are distinct fields that are unrelated to each other. I can however, turn field1 and field2 into an equality index field by introducing a boolean field that indicates whether the fields are present, e.g. hasField1 and hasField2. field3 is any array and $in can be used as an equality field operator. These will sufficiently narrow results down so that I can really just use the a couple dozen indexes where the prefixes are all of the variations of field1 + field2 + field3 and the sort portion of the each index is _id, or one of the fields in the sort fields list + _id. But as the data set grows, I worry that these indexes will not perform as well unless adding more fields, potentially ones that could be used for Range narrowing in the ESR rule. That would mean even more indexes, though, and that seems untenable.

Thank you again for your timely feedback, it was super helpful. Let me know what you think about my concerns above.

Gergely_Nagy · January 23, 2025, 1:33pm

@steevej Sorry, but could you explain to me what you mean by that?

The page you linked actually lists dozens of edge cases, and one of the official example is actually NOT using the range predicate as the last predicate (because there is none in that case).

So it’s pretty much possible that the sort should go last.

@Tom_Hartwell I am actually pretty much baffled with Mongo’s design decision just like you.

We have a collection of 800 million documents, and no matter what combination we use of single field or compound indexes, there is always a query which turns out too slow, and adding too many indexes slows down all queries in general.

It’s just untenable to add dozens of indexes just to satisfy the requirements for a broad category of use cases.

It feels to me that Mongo’s design principle would actually prefer to store all data pre-shaped according to it’s specific use case, which is fine for laser sharp focused applications, but it causes unreasonable dev complexity in the codebase and it makes maintenance a real nightmare. VERY disappointing.