Why collection scan performance is better without index in my case?

Anderson_Lin · December 31, 2024, 9:13am

Hi,

A collection with 5M docs, its schema is:

{
  "appId": 999,
  "os": 1,
  "activeDay": [
    "2024",
    "202404",
    "20240405",
    "20240410",
    "20240411",
    "20240412",
    "20240413",
    "20240414",
    "20240415",
    "20240416",
    "20240417",
    "202405",
    "20240519",
    "20240521",
    "20240526",
    "20242Q",
    "2024W14",
    "2024W15",
    "2024W16",
    "2024W20",
    "2024W21"
  ]
}

The aggregate pipeline is:

[
  {
    $match: {
      "activeDay": "2024",
      os: {$gt: 0, $lt: 9}
    }
  },
  {
    $group: {
      _id: "$appId",
      value: {
        $count: {}
      }
    }
  }
]

The index is:

{
    "activeDay": 1,
    "os": 1,
    "appId": 1
}

The explain plan with collection scan:

The explain plan with index:

My question is:

Why collection scan performance is better than using index in my case?

I also attach two executeStats files.
Please let me know if any comments, thank you very much.

executionStats_with_coll_scan.txt (8.2 KB)
executionStats_with_index.txt (9.6 KB)

Billy_Bui · January 3, 2025, 8:32am

After inspecting the plan of both scenario. This is my understanding.

Your query want to the number of documents that have "2024" in the list of activeDay and 0 < os < 9 for each appId. Take note that even if "2024" appears more than 1 time in activeDay array, you only count it once.

Let’s do COLLSCAN, you inspect the document one by one, if there is any "2024" in activeDay and 0 < os < 9, you pass that document (or more precise, the appId value) to the next stage.
Next group stage then count the number of apperance for each appId value then return results.
Quite straigthforward. Even if there are multiple "2024" value in activeDay, as soon as we encounter the first one, we can safely skip the rest of the array and mark current document as satisfied the filter.

Now we create muti-key index on { appId: 1, os: 1, activeDay: 1 }. Because activeDay is an array of values, the index will unwind the array so that it can index them. Therefore, multiple index keys can point to a single document, I think that’s why it’s called “multi-key”.

For example:

// Document
{
  _id: "a", // just for demonstation
  appId: 999,
  os: 1,
  activeDay: ["2024", "202401", "202402", "2024"]
}

The multi-key index will index following keys

{ appId: 999, os: 1, activeDay: "2024" }    => { _id: a }
{ appId: 999, os: 1, activeDay: "202401" }  => { _id: a }
{ appId: 999, os: 1, activeDay: "202402" }  => { _id: a }
{ appId: 999, os: 1, activeDay: "2024" }    => { _id: a }

Let’s do the same query again using this index. As you can see, I purposely put "2024" twice to making the point.
Now we inspect the index key one by one, because all needed field: os and activeDay for filter and appId for group are all indexed, the index has all infomation to perform the query and no need to look at the actual document, a.k.a. PROJECTION_COVERED.
If value of activeDay is "2024" and 0 < os < 9, then we pass the index key to the next stage.
Now you can see with above example, there are 2 { appId: 999, os: 1, activeDay: "2024" } => { _id: a } key that satisfy the query and pass to the next stage, even though it’s in the same document.
So because we only need to count the number of document, we need to eliminate all duplicated references ({ _id: a }) so that the group stage only need to count the apperance then return results. That’s the unique stage in the execution plan.
You can see that the stage take 8185 - 4574 = 3611 ms only to make sure there are no duplicate document reference after index scan. And in your case, I can tell that there are no such case as above because of "dupsDropped" : 0.0. But the database engine does not know that fact, and therefore, it has to run unique stage.

Hope this can help you.

Anderson_Lin · January 3, 2025, 9:05am

Thank you, sir!

Your explanation is very clear!
So, if I want to eliminate “unique” stage, I should make my index as unique, right?

Billy_Bui · January 3, 2025, 10:31am

Making the index unique will ensure that there are no “accident” data in your collection that break the constraint, for example, following 2 cases will throw error:

// duplicate in a document
{ _id: "a", appId: 999, os: 1, activeDay: ["2024", "2024"] }

// duplicate across documents
{ _id: "b", appId: 999, os: 1, activeDay: ["2024", "202401"] }
{ _id: "c", appId: 999, os: 1, activeDay: ["2024", "202402"] }

But in the end of day, the planner will decide to infer the information or not. As far as I know, it might still using unique stage depends on the plan. Might be someone with deeper insight of query planner can answer this question.

For now, my best suggestion is to keep testing with different index keys and options and decide the best compromise for these “edge” cases. Maybe you will stump upon some interesting discoveries and might share your knowledge back to me.

For example, if your filter on os can narrow the number of documents significantly, you can create index { os: 1 }. The query then might using this index to limit the number of documents need to be scanned, then lookup the actual document to further filtering, without using unique. This plan might have the lookup penalty but might still better than doing unique. Well, unique use sort and sort is expensive.

system · January 8, 2025, 10:31am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.