How would the ESR rule of Indexes behave if one of the fields is optional?

My sample document:

{
	"status": 2,
	"queue": true,
	"department": "apple",
    "ts": ISODate("2022-05-10T08:35:52.451Z")
}

I typically have two kinds of queries, where one of the fields in the query is optional:

{
	"status": { "$ne": 4 },
	"queue": true,
	"department": { "$in": ["banana"] }
}, 
{ "sort": { "ts": -1 }, "limit": 20}

The second type:

{
	"status": { "$ne": 4 },
	"queue": true
}, 
{ "sort": { "ts": -1 }, "limit": 20}

(notice there is no department field)

I also use ts field for pagination:

{
	"status": { "$ne": 4 },
	"queue": true,
	"department": { "$in": ["banana"] },
    "ts": { "$lte": lastTimestamp }
}, 
{ "sort": { "ts": -1 }, "limit": 20}

Applying ESR rule:

E = status, queue, department
S = ts
R = ts

Now, what is the ideal index?

  1. A single index like following can cover all of these queries?
{
	"status": 1,
	"queue": 1,
	"department": 1,
    "ts": 1
}
  1. Or do I need to create one more index without the department
{
	"status": 1,
	"queue": 1,
    "ts": 1
}

Most likely with the single index status:1,queue:1,department:1,ts:1 you will end up with in memory sort of ts if the department is not specify.

Depending of the selectivity of status and queue vs the cardinality of department you might be better off with the index status:1:queue:1,ts:1.

Note that $ne:4 might not be able to use the index. Not-equalilty is not the same as Equality. It is closer to Range.

whoa, TIL! Then order of the status filed should be changed I guess? something like:

{
	"queue": 1,
	"department": 1,
    "ts": 1,
    "status": 1,
}

?

The best is to try alternatives and to compare the explain plans. Start with https://www.mongodb.com/docs/manual/tutorial/analyze-query-plan/.

1 Like

Hey @V_N_A,

Just adding a few more points to what @steevej has rightly said. :smile:

Firstly, I assume that when you wrote the first query ie.

{
	"status": { "$ne": 4 },
	"queue": true,
	"department": { "$in": ["banana"] }
},
{ "sort": { "ts": -1 }, "limit": 20}

you meant this:

find({"status": {"$ne": 4}, "queue": true, "department": {"$in": ["banana"]}}).sort({ts:-1}).limit(20)

because the first one is just projecting the documents.

Coming back to the post, a general rule of thumb one considers when dealing with compound indexes is that you don’t need an additional index if your query can be covered by the prefix of the existent compound index. Another thing to note is that Inequality operators such as $ne or $nin are range operators, not equality operators, so moving status to the end might help speed up your first query. But in the case of your second query, since there is no department, you will end up with memory sort since prefixes won’t work.

About the SORT stage: although having a SORT stage is not necessarily a bad thing if the result set is reasonably small, the best outcome is to avoid having this stage altogether, since it can cause the query to fail if it runs out of memory to do the sorting. Using proper indexing can avoid the SORT stage.

To demonstrate out all this, I tried the same with the sample document you provided and created a sample collection with 1000 such documents using mgeneratejs, a tool to create example documents by following some patterns you define.

If we create an index like the one you suggested here:

I named the index status_queue_department_ts and so on the first query ie.

{"status": {"$ne": 4}, "queue": true, "department": {"$in": ["banana"]}}).sort({ts:-1}

the Explain Plan gives the following results:

expRun = db.collections.explain("executionStats")
expRun.find({"status": {"$ne": 4}, "queue": true, "department": {"$in": ["banana"]}}).sort({ts:-1})
...
  executionStats: {
    executionSuccess: true,
    nReturned: 152,
    executionTimeMillis: 0,
    totalKeysExamined: 169,
    totalDocsExamined: 152,
    executionStages: {
      stage: 'FETCH',
...
      inputStage: {
        stage: 'SORT',
...
        inputStage: {
          stage: 'IXSCAN',
...

From the execution stats, it shows it examined a total of 169 keys and 152 documents and returned 152 documents, which is not a bad query targeting ratio. Note that the winning plan contains a SORT stage.

Now for the second query ie. without the department field, the same .explain, gives the following results:

expRun.find({"status": {"$ne": 4}, "queue": true}).sort({ts:-1})
...
  executionStats: {
    executionSuccess: true,
    nReturned: 477,
    executionTimeMillis: 2,
    totalKeysExamined: 486,
    totalDocsExamined: 477,
    executionStages: {
      stage: 'FETCH',
...
      inputStage: {
        stage: 'SORT',
...
        inputStage: {
          stage: 'IXSCAN',
...

we can see now, that it has returned 477 documents and had to scan 486 keys. This is also not a bad query targeting ratio, but this also contains a SORT stage.

Thus for your first index, in conclusion, it’s not the ideal index for the query since the explain output for both queries contains a SORT stage, even though the query targeting metric is not too bad.

For your second index ie. status_queue_ts, the first query returned:

...
  executionStats: {
    executionSuccess: true,
    nReturned: 152,
    executionTimeMillis: 1,
    totalKeysExamined: 486,
    totalDocsExamined: 477,
    executionStages: {
      stage: 'SORT',
...
      inputStage: {
        stage: 'FETCH',
...
        inputStage: {
          stage: 'IXSCAN',
...

ie. 486 keys scanned for only 152 returned documents. This is not great, since it means the server needs to examine ~3 documents for each relevant one, thus the server does a lot of unnecessary work. It would be the same if no department is mentioned as is the case of your second query. Note that this also contains a SORT stage, which makes it even more unappealing.

Now, we if change our index order to

{
   "queue":1,
   "department":1,
   "ts":1,
   "status":1
}

naming it as queue_department_ts_status, for the first query, .explain returns:

...
  executionStats: {
    executionSuccess: true,
    nReturned: 152,
    executionTimeMillis: 0,
    totalKeysExamined: 168,
    totalDocsExamined: 152,
    executionStages: {
      stage: 'FETCH',
...
      inputStage: {
        stage: 'IXSCAN',
...

We can see it scanned 168 keys and returned 152 documents (not a bad query targeting metric), all without using a SORT stage.

For the second query, however, since there is no department, it relies on memory sort:

...
  executionStats: {
    executionSuccess: true,
    nReturned: 477,
    executionTimeMillis: 2,
    totalKeysExamined: 526,
    totalDocsExamined: 477,
    executionStages: {
      stage: 'FETCH',
...
      inputStage: {
        stage: 'SORT',
...
        inputStage: {
          stage: 'IXSCAN',
...

The number of keys scanned increases to 526 with the same 477 documents returned, but all 477 examined documents are returned, so the server does a little unnecessary work. However the presence of the SORT stage negates this.

As we can see, the selection of the order of indexes would highly depend on what query you are going to use the most as well as the operators you will use in them. Eg. When $in is used alone, it is an equality operator that does a series of equality matches, but acts like a range operator when it is used with .sort(). Additionally, the use case you posted might require two different indexes if you want to avoid having a SORT stage. One with department, and the other without.

I would highly suggest you read the following to cement your knowledge about indexes.

Please let us know if there’s any confusion in this. Feel free to reach out for anything else as well.

Regards,
Satyam Gupta

7 Likes

Hey Satyam, thank you for taking your time and replying. It was insanely helpful! Let me read all the resources you linked and get back to you!

2 Likes

Hey @Satyam I had a question regarding range queries and cardinality. What should be the order, is it like high cardinality to low cardinality in case of range fields?

for e.g. in the queries above, cardinality for ts is very high compared to status. So which is correct for the following:

{
   "queue":1,
   "department":1,
   "ts":1,
   "status":1
}

OR

{
   "queue":1,
   "department":1,
   "status":1,
   "ts":1,
}

Hi @V_N_A,

Since we don’t know the statistical distribution of the fields in your collection, we can only simulate the possibilities (see the previous answer). You might want to perform a similar explain() experiments using your actual data to arrive at the best scenario that works in your usecase (try having no COLLSCAN, no SORT).

That said, please do remember that the selection and ordering of an index depends on a lot of factors. The purpose of an index is to be able to zoom in to a relevant part of the collection and eliminate as much irrelevant documents as possible so that the server don’t need to do unnecessary work to while running the query. When talking about cardinality, fields with low cardinality have a comparably lower value in an index. For example, if a field cardinality is so low that it encompass a third of the collection, it’s less useful as an index vs. a field with high cardinality that can identify a single document (e.g. _id ).
I believe that your compound index would help with range and sorting, so the ESR rule of thumb should be more relevant for field ordering choice in a compound index, irrespective of the cardinality involved. But I would still suggest you to try creating the indexes and checking the explain output to see which one works better for your use case (since you know your exact dataset and the queries that you are going to use mostly). The resources that I mentioned in previous answer should also help you a lot.

Let us know if there is any confusion regarding this. Feel free to reach out for anything else as well.

Regards,
Satyam

3 Likes

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.