Aggregations: Why is $group slow with indexed string fields but not on date?

Setup

Imagine this document shape:

{
  dateString: "2024-10-31", // type string
  realDate: ISODate("2024-10-31"), // type date
}

Assume both fields are indexed separately.

Problem

If I create an aggregation with a $match stage followed by a $group stage, the type of the queried field makes a difference:

  • on dateString, the $group stage causes IXSCANFETCHGROUP actions (as seen in explain). This is considerably slower than querying on realDate.
  • on realDate, the $group stage causes only IXSCANPROJECTION_COVERED ← This is magnitudes faster (on 500k records: 9 ms vs. 30 sec+)

Any ideas why that is and how to solve this?


Aggregations:

// This aggregation is slow despite the string field being indexed.
[
  {
    $match: {
      dateString: {
        $gte: "2024-10-01",
        $lte: "2024-10-31"
      }
    }
  },
  {
    $group: {
      _id: "$dateString",
      count: { $sum: 1 }
    }
  }
]
// This aggregation is fast because it's on an indexed date field.
[
  {
    $match: {
      realDate: {
        $gte: ISODate("2024-10-01"),
        $lte: ISODate("2024-10-31")
      }
    }
  },
  {
    $group: {
      _id: {
        $month: "$realDate"
      },
      count: {
        $sum: 1
      }
    }
  }
]

MongoDB version: 7.0.15

Through some help on StackOverflow we could narrow down the issue to our collation.

Question to MongoDB professionals: Why is it that collections prohibit the index from covering the query fully? How can this be solved?

I will answer this first. Index can only cover those queries that have the same collation. So when you create your index, you can specify the collation that your queries will use to the index. For example, if your query use en collation

db.collectionName.createIndex({ dateString: 1 }, { collation: { locale: "en" } });

Above index will cover any queries that use { collation: { locale: "en" } }.


Now you know how to create index with a specific collation. The reason an index cannot cover queries that use different collation is that the underlying index key data are different. To clearly demonstrate this point, you can create another index with different collation, for example, using default simple collation.

db.collectionName.createIndex({ dateString: 2 }, { collation: { locale: "simple" } });

Now we can issue the same query/aggregate to see the different

db.collectionName.aggregate([.../** use dateString to group */], { collation: { locale: "en" } }).explain()
/** You will see something like this
 * {
 *   ...
 *   indexBounds: {
 *     dateString: [
 *       '[CollationKey(0x1612161a050e1412050e1214010e010e), CollationKey(0x1612161a050e1412050e1814010e010e))'
 *     ]
 *   }
 *   ...
 * }
db.collectionName.aggregate([.../** use dateString to group */], { collation: { locale: "simple" } }).explain()
/** You will see something like this
 * {
 *   ...
 *   indexBounds: { dateString: [ '["2024-10-01", "2024-10-31")' ] }
 *   ...
 * }

The simple collation is actually the raw binary data, so the index key don’t need to encode anything, that’s why you see the actual string date here. But en collation need to encode your string to the CollationKey so that it can sort accordingly. But since the data is encoded, if you issue query with simple collation on en collation index, it will compare "2024-10-01" with a bunch of CollationKey(0x...). You can see now it’s quite impossible for it to know the value is matched or not.

Then there are 3 cases here.

  • If you only have en collation index and issue simple collation query, the index is useless, so your query will use COLLSCAN to get the documents, then get dateString field form them, then pipe to $group stage.
  • If you have en collation and issue en collation query, the index can be used now, but the key is not your desired value, it’s encoded. So after IXSCAN to set matching keys, your query need to FETCH the actual documents of those keys, get dateString from documents then pipe to $group stage.
  • If you have simple collation and issue simple query, the index can be used, and it has raw data in it. So your query is covered, so PROJECTION_COVERED can get the field data directly from index key, then pipe to $group.

TL;DR: Index keys in collated index (not simple) are encoded, so your query always need to fetch the document to get field value.

Thanks.

Since we used the matching collation in the collection and the query (both German), the reason for the FETCH stage is that within the collated index, the key value is encoded and thereby useless for $group, even though the actual string data (e. g. “2024-10-01”) would not appear to require a change to be displayed in German (2024-10-01 is the same in German and in English; but technically, collations transform this differently). I understand that.

1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.