MongoDB index, how does it work?

Luciano_Bigiotti · September 18, 2023, 9:58am

Hi community,

I’m working on the index building.

I read the mongoDB documentation that talked about best practice on building index.
For example, I got that you should follow the ESR rule or that is better put in the first place of the index the attribute that have an higher cardinality.

I thought I’ve grasp the concept but when I put in practice there were a few things that did not work as I expected.

Let me explain better.

The query under-study is the following (do not pay attention to eventually mistakes and how the query is written):

{
	$and[	
		{
		   “a”:1
		},
		{
		  “b”:2
		},
		{
		  “c”:
			{$in:[3,4,5]}
		},
		{
		  “d”: true // this is always fixed
		}
	],
	$or[
		{
		  e: {$regex: “something”, $options: I}
		}
	]

	$sort:{
		{ f: 1}
	}
}

I created also the following index (I used as I said the ESR and the cardinality rules):

a:1, b:1, f:1, c:1, e:1
I used also the partial index on the d field that is fixed for all the query and the collation to get better performance with the case insensitive regex.

But, if I run the query with the explain I get that the index used is another one that have some fields of the query and other one that is not present, for example:

b:1, c:1, a:1, h:1, i:1
(c is a range and it should go after the sort field, h and i are not even present in the query)

So, my questions are:

how is it possible that an index with field not present in the query have an higher score than mine?
Another strange thing that I have observed is that the index chose change with the number of result returned by the query. I mean, if the query returns 6 documents mongodb uses an index but if it returns 100 documents mongodb uses a different index, how is it possible?
Is there a best practice to understand what is the best field to use in an index?

Thanks in advance!

Peter_Hubbard · September 18, 2023, 1:32pm

If there are multiple candidate indexes for a given query, the MongoDB Query Planner will test their relative performances by seeing which index returns 100 documents first. That index will then be used going forward, until another candidate index is added or the MongoDB process is restarted.

Your ‘e’ field is queried with a regex, which can only use an index if it is anchored left.

Regarding using different indexes: each clause of an $or query can use a different index, perhaps this is what you are seeing.

Christopher_Harris · September 18, 2023, 1:54pm

Hi @Luciano_Bigiotti, great questions!

Further to what my colleague @Peter_Hubbard mentioned, I’d broadly say that it is the job of any query optimizer to execute all queries as efficiently as possible. If you want to dive into the specifics of why one index/plan was chosen over another then you may wish to review the .explain("allPlansExecution") output for the operations.

I spoke about this topic a few years ago in Tips and Tricks for Query Performance: Let Us .explain() Them. You can separately find a diagram about how plan caching work presently works in MongoDB on our Query Plans page.

Best of luck!
-Chris

Luciano_Bigiotti · September 20, 2023, 7:21pm

Hi Peter, Christopher,

Thanks for your answer and for the video.
I watched it very carefully and I found it very interesting.

But I did not find out the answer to all my questions.

I do not understand why, executing the same query but with different values, mongoDB uses different index.

For example, if I run:

{
	$and[	
		{
		   “a”:1
		},
		{
		  “b”:2
		},
		{
		  “c”:
			{$in:[3,4,5]}
		},
		{
		  “d”: true // this is always fixed
		}
	],
	$or[
		{
		  e: {$regex: “something”, $options: I}
		}
	]

	$sort:{
		{ f: 1}
	}
}

that returns 6 documents, mongoDB uses an index (let me call indexA), but if I run:

{
	$and[	
		{
		   “a”:10
		},
		{
		  “b”:12
		},
		{
		  “c”:
			{$in:[3,4,5]}
		},
		{
		  “d”: true // this is always fixed
		}
	],
	$or[
		{
		  e: {$regex: “something”, $options: I}
		}
	]

	$sort:{
		{ f: 1}
	}
}

that returns 200 documents, mongoDB uses another index (let me call indexB).

How is it possible that changing the values in the query, mongoDB changes the used index?
The index used in a query is selected by query planner at the beginning and must be the same until a new query planner process start, isn’t it?

Another question:
Apart from the ESR rule and the cardinality, is there some other best practice to use?
Is there a way to choose the fields to use in an index?

Thanks in advance

Christopher_Harris · September 20, 2023, 9:09pm

Glad to hear that the video was helpful!

The .explain("allPlansExecution") output from the specific queries of interest in your environment would have the direct answers to some of your questions. Without that information we can respond to your questions in general terms, please find such responses below.

How is it possible that changing the values in the query, mongoDB changes the used index?

This is intentional and by design. I mentioned in my previous response that the job of any query optimizer to execute all queries as efficiently as possible. There are two components of that sentence which are important. The first is that I specifically didn’t make reference to a phrase like “the best index”. This is because it is not always the case that there is a single index that performs optimally for a given query shape. The second is that I mentioned “all queries”. The specific values of the query predicates matter for both of these items and optimizers do their best to account for them when planning and executing queries.

The index used in a query is selected by query planner at the beginning and must be the same until a new query planner process start, isn’t it?

No, this is not correct. In general the optimizer does attempt to reuse plans via a caching mechanism to minimize the amount of repetitive work that it needs to do. However because the values of the predicates can make a difference when it comes to query efficiency, there are safeguards in place to prevent plans from getting inappropriately used indefinitely for a given query shape. The aforementioned Query Plans page contains some details about the plan caching process.

Apart from the ESR rule and the cardinality, is there some other best practice to use?

Cardinality is typically not nearly as important of a factor when it comes to designing indexes as other things (such as they key ordering and reusability). The ESR Strategy provides easy-to-remember guidance that is effective in a variety of situations. That doesn’t mean that it always has the answer as there are always situations that require further consideration and have additional nuance. That said, I am not personally aware of any other ‘rules of thumb’ that are as generally applicable or that supersede this one.

Is there a way to choose the fields to use in an index?

I’m not sure what this question means.

When preparing an execution plan for a query, the optimizer will take a look at the fields that are used in the query and will bound the index scan as much as possible. Its ability to do so is driven by how the fields are used in the index along with the structure of the index itself. The optimizer will be as aggressive as possible when applying the rules while making sure that the result set will be logically correct.

Hope this helps.

Best,
Chris