Index recommendation for a queries that does not tick all boxes of the ESR rule

v_b · September 9, 2022, 3:28am

I already know about the Equality-Sort-Range rule when designing optimal indices for MongoDB queries, however, recently I’ve came across some queries that I am not sure which index would be the best fix.

The collection has tens of millions of documents.

The fields are:

id: string, an identifier with uniqueness constraint, high cardinality
score: number, a score used for sorting, lower cardinality than a, for some queries it has high cardinality, for others it may have low cardinality. Most of the queries the cardinality of b will be low.

The queries are:

find({id: {$in: [array of ids, average size 1000]}}).sort({score: -1, id: -1}).limit(10)

find({
  $and: [
    {id: {$in: [array of ids, average size 1000]}},
    {
      $or: [
        { score: { $lt: 100 } },
        { score: 100, id: { $lt: "some id" } },
      ],
    },
  ],
}).sort({score: -1, id: -1}).limit(10)

So basically what I am trying to do, is seek based pagination, based on the result of the first query, I can obtain some anchors to look for data in the next page. Say:

I have 5 documents, and the page size to return data is 2.
[id: “id 1”, score: 2]
[id: “id 2”, score: 4]
[id: “id 3”, score: 5]
[id: “id 4”, score: 4]
[id: “id 5”, score: 1]

First page will be:
[id: “id 3”, score: 5]
[id: “id 4”, score: 4]
with query find({id: {$in: ["id 1","id 2","id 3","id 4","id 5"]}}).sort({score: -1, id: -1}).limit(2)

Second page will be:
[id: “id 2”, score: 4]
[id: “id 1”, score: 2]
with query

find({
  $and: [
    {id: {$in:  ["id 1","id 2","id 3","id 4","id 5"]}},
    {
      $or: [
        { score: { $lt: 4 } },
        { score: 4, id: { $lt: "id 4" } },
      ],
    },
  ],
}).sort({score: -1, id: -1}).limit(10)

According to the ESR rule, the index should probably be something like {filterId: -1, score: -1}, so that the $in equality rule is satisfied first. But there would still be in memory sorts. Is there a better index available?

Jason_Tran · September 15, 2022, 7:59am

Hi @v_b - Welcome to the community.

According to the ESR rule, the index should probably be something like {filterId: -1, score: -1}, so that the $in equality rule is satisfied first. But there would still be in memory sorts. Is there a better index available?

Note, I believe filterId mentioned above should be Id but correct me if I am wrong here.

In saying so, based off the sample documents and details you’ve provided I am curious if you have attempted to create an index definition of {"score": -1, "id": -1}?

I compared both the following:

{ "id": -1, "score": -1}
{ "score": -1, "id": -1}

The query planner in my test environment which only contains the sample documents you’ve provided selects the index 2. { "score": -1, "id": -1} to use for both of the queries you had advised.

In terms of the { "score": -1, "id": -1} index, here are some details of the winningPlan for the query:
db.collection. find({id: {$in: ["id 1","id 2","id 3","id 4","id 5"]}}).sort({score: -1, id: -1})

winningPlan: {
      stage: 'FETCH',
      filter: { id: { '$in': [ 'id 1', 'id 2', 'id 3', 'id 4', 'id 5' ] } },
      inputStage: {
        stage: 'IXSCAN',
        keyPattern: { score: -1, id: -1 },
        indexName: 'score_-1_id_-1',
        isMultiKey: false,
        multiKeyPaths: { score: [], id: [] },
        isUnique: false,
        isSparse: false,
        isPartial: false,
        indexVersion: 2,
        direction: 'forward',
        indexBounds: { score: [ '[MaxKey, MinKey]' ], id: [ '[MaxKey, MinKey]' ] }
      }
    }

We can see that the index is used before a filter is done for the specified “id” values.Additionally, here’s some further details regarding some of the testing with both indexes (from my test environment with only 5 of the sample documents you have provided):

find() without $or:
IXSCAN → FETCH → LIMIT using {score:-1, id:-1} index
IXSCAN → SORT → FETCH using {id:-1, score:-1} index
With $or:
( IXSCAN + IXSCAN ) → SORT_MERGE → FETCH → LIMIT all using {score:-1, id:-1}
IXSCAN → FETCH (with conditions) → SORT using {id:-1, score:-1}
Rejected plan: ( IXSCAN + IXSCAN ) → SORT → FETCH using {id:-1, score:-1}

Note: {score:1,id:1} would work in the same manner as {score:-1,id:-1} (just in reverse) for this test.

Based off the 5 sample documents you’ve provided in my test environment, we can see in both instances that the index {id:-1, score:-1} will have an in memory sort. This is due to the sort() condition you have specified. Where as compared to the index {score:-1,id:-1} , the documents are already sorted and just need to be fetched.

For your secondary query, the winning plan differs slightly as there is use of the $or operator with some range conditions in which a ‘SORT_MERGE’ is performed. Although this index may work for your use case, there isn’t a single “silver bullet” index for all queries. You’ll need to consider it on a query-by-query basis. You could possibly consider {score:1, id:1} and {id:-1, score:-1} indexes for the query using $or so that the query planner may decide which index to use for each branch of the $or operator.

In saying the above, I believe your main concern is the in-memory sort (correct me if I am wrong here). The SORT stage is also less desirable due to the 100MB limit of memory use (in which the operation will result in an error if it exceeds this amount and allowDiskUse() is not used).

It’s important to note that the ESR rule is used as a general rule of thumb and is a good starting place applicable to most use cases (i.e. not all). I would recommend going over the MongoDB .local Toronto 2019: Tips and Tricks for Effective Indexing slides which also has an example of an exception to the ESR rule. Additionally, perhaps the Sort of Multiple Fields documentation may be of use as well.

If you’re on Atlas using an M10+ tier cluster, you should have access to the performance advisor which may be able to help you optimize your indexes.

It is highly recommended you test this index first against a test environment with more data to verify it meets all your use case / requirements before creating this in production. You can verify the query execution in more detail using db.collection.explain("executionStats").find() .

Regards,
Jason