How to use index in $group with $sum then $sort

Daniel_Li1 · September 18, 2023, 2:53am

Good morning! I have a collection, product_part_number, and it contains a little more than 2.3 million documents like the following:

  { _id: 2988725, partNumber: '2WFG4' },
  { _id: 3177996, partNumber: '2WFP4' }

(Yes, it only contains _id and partNumber). I also have the following compound index:

  {
    v: 2,
    key: { partNumber: -1, _id: 1 },
    name: 'partNumber_-1__id_1'
  }

Now I need to see which partNumber corresponds to the most products using the following aggregation:

db.product_part_number.aggregate( [
  {
    $group: {
      _id: "$partNumber",
      count: { $sum: 1 }
    }
  },
  {
    $sort: { count: -1 }
  },
  {
    $limit: 1
  }
])

However, running explain(“executionStats”) results in an error:

MongoServerError: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in.

If I add { allowDiskUse: true } to the expalin() call I can see no any index is used:

        executionStats: {
          executionSuccess: true,
          nReturned: 2388237,
          executionTimeMillis: 15192,
          totalKeysExamined: 0,
          totalDocsExamined: 2388237,
      .
      .
      .

In my situation, how can I use index? Thanks a lot, in advance!

Bhavya_Bhatt · September 18, 2023, 3:46am

You can use hint parameters to force query to use index

cursor.hint() - MongoDB shell method - w3resource.

Jack_Yang1 · September 18, 2023, 4:01am

@Daniel_Li1 How many distinct ‘partNumber’ values are there in the collection?

John_Sewell · September 18, 2023, 5:35am

Have you tried adding a sort to the query before the group? I tested this locally (v6 Server) and without the sort it’s using a COLSCAN and after it’s using an index scan:

db.Test.explain().aggregate( [
{
    $sort:{
        'partNumber':-1
    }
},
  {
    $group: {
      _id: "$partNumber",
      count: { $sum: 1 }
    }
  },
  {
    $sort: { count: -1 }
  },
  {
    $limit: 1
  }
])

Old:

          winningPlan: {
            queryPlan: {
              stage: 'GROUP',
              planNodeId: 2,
              inputStage: {
                stage: 'COLLSCAN',
                planNodeId: 1,
                filter: {},
                direction: 'forward'
              }
            },

New:

          winningPlan: {
            queryPlan: {
              stage: 'GROUP',
              planNodeId: 3,
              inputStage: {
                stage: 'PROJECTION_COVERED',
                planNodeId: 2,
                transformBy: { partNumber: true, _id: false },
                inputStage: {
                  stage: 'IXSCAN',
                  planNodeId: 1,
                  keyPattern: { partNumber: -1, _id: 1 },
                  indexName: 'partNumber_-1__id_1',
                  isMultiKey: false,
                  multiKeyPaths: { partNumber: [], _id: [] },
                  isUnique: false,
                  isSparse: false,
                  isPartial: false,
                  indexVersion: 2,
                  direction: 'forward',
                  indexBounds: {
                    partNumber: [ '[MaxKey, MinKey]' ],
                    _id: [ '[MinKey, MaxKey]' ]
                  }
                }
              }
            },

Old Stats:

        executionStats: {
          executionSuccess: true,
          nReturned: 100,
          executionTimeMillis: 6,
          totalKeysExamined: 0,
          totalDocsExamined: 10000,

New Stats:

        executionStats: {
          executionSuccess: true,
          nReturned: 100,
          executionTimeMillis: 16,
          totalKeysExamined: 10000,
          totalDocsExamined: 0,

Interestingly with a larger data set the execution time is still faster without the sort and doing a colscan, even though it’s a covered index.

Daniel_Li1 · September 18, 2023, 2:43pm

Adding

{
    $sort:{
        'partNumber':-1
    }
},

at the beginning of the aggregation improved a lot and now explain(“executionStats”) returns

        executionStats: {
          executionSuccess: true,
          nReturned: 2388237,
          executionTimeMillis: 15001,
          totalKeysExamined: 2388237,
          totalDocsExamined: 0,

Now the only thing I don’t understand is that it still complains

MongoServerError: Exceeded memory limit for $group, but didn't allow external sort. Pass allowDiskUse:true to opt in.

if I remove { allowDiskUse: true}. The index is only about 37M. How can sorting need more than 100M?

Daniel_Li1 · September 18, 2023, 3:16pm

@Jack_Yang1 There are totally 2,287,646 unique partNumber values.

Daniel_Li1 · September 18, 2023, 3:18pm

@Bhavya_Bhatt Without changing the aggregation, simply adding hint(‘partNumber_-1__id_1’) does not change anything.

John_Sewell · September 18, 2023, 3:31pm

I’m sure there is lots of magic and optimisations behind the scenes but from a simplistic view, doing a group is much easier when data is sorted as you know when one group ends and another begins. Without the $sort, it takes a lot more effort, i.e.

Grouping: “AABACAABA” is harder than:
Grouping: “AAAAAABBC”

You know a group has finished when the letter changes so you have to keep track of less as you go along. I imagine there are lots of clever sorting optimisations that the server really does.

The memory issue you’re still getting is interesting, how large is your collection in size? The actual size and not size on disk? I’m not sure if the index reported size is compressed or not, remember that an index will be a tree structure so data is reduced from the actual source data. When you actually build documents from an index it’ll get bigger.

So if you had four documents:
“ABC”
“ABD”
“ABE”
“ACE”

The index only needs to store, AB CDE and ACE, reducing storage from 12 Chars to 8, but if you want to output the data, and it’s a covered index, it will need to re-create the data and so you’re back to 12 in memory storage. You obviously have overhead for the structure…but with lots of data, things tend to go in your favour!

(I’m sure someone can correct me if I’d made a terrible error above)

To sum up my ramblings, while the index may be 37M, the data that’s processed may get bigger, i.e. the size of your collection if you’re processing it all, although optimisation will be made along the way so I’m sure it’ll be much, much smaller…

Daniel_Li1 · September 18, 2023, 4:38pm

@Bhavya_Bhatt Appreciate your help, John! You are an expert

Daniel_Li1 · September 18, 2023, 5:26pm

@Johan_Forssell I exported the collection to my local in text format and the total size is exactly 99,192,344 bytes. The following are a few documents:

 { _id: 2988725, partNumber: '2WFG4' },
  { _id: 3177996, partNumber: '2WFP4' },
  { _id: 3072736, partNumber: '2WFP5' },

So still do not understand why sorting needs more than 100M memory. Thanks a lot!

John_Sewell · September 18, 2023, 5:50pm

You’re probably on the cusp of in-memory there, if you filter it down a bit (with a $limit or something) how much do you need to reduce to not have the issue?

With the $sort added and using allowdiskuse, does that calculate in enough time for your use case?

At this point I guess it’s down to if you think each time you call this query, is it reasonable to take the hit of this query, if the data only updates on occasion then you could pre-calculate it when making changes, or just do it every so-often.

There have been a few questions about grouping large datasets recently and there gets to a point with mongo where some things just take a little while, it may be quicker on some RDMS servers but mongo gives you other flexibility!

You could do a merge update with this, so when you run it, it works out the most used and updates the document that matches.

You could store the data as arrays with a document for part number…but you need to be wary in case you have a monster product that has a million components and you’ll blow the doc size limit (16MB).
I’m sure running a $size and getting the biggest would be pretty quick.

Daniel_Li1 · September 18, 2023, 6:52pm

@Johan_Forssell I added a $match stage as follows:

db.product.explain("executionStats").aggregate([
  {
    $sort: {partNumber: -1},
  },
  {
    $group: {
      _id: "$partNumber",
      count: { $sum: 1 }
    }
  },
  {
    $match: {count: {$gt: 8}},
  },
  {
    $sort: { count: -1 }
  },
  {
    $limit: 1
  }
], { allowDiskUse: true })

but seems no changes. By the way, now I am very satisfied with this result. We only need to run this aggregation once or twice a year manually. Thanks again for your help and have a nice day, John!

steevej · September 19, 2023, 2:23pm

You are sorting on the computed value count. No index can be used when sorting on a computed value. Indexes are used on stored field. A $sort on computed values will always be a memory sort, so it uses a lot of memory.

A $group stage is blocking. All incoming documents must be processed before the first resulting document is passed to the next stage, so it uses a lot of memory.

One trick I use to reduce the memory used by $group is to $group without an accumulator, then use $lookup to implement the accumulator logic. In your case, it may reduce the amount of memory needed by group by 50%. What I mean is something like:

{ "$group" : {
    "_id" : "$partNumber"
} }
lookup = { "$lookup" : {
    "from" : "product_part_number" ,
    "as" : "count" ,
    "localField" : "_id" ,
    "foreignField" : "partNumber" ,
    "pipeline" : [
        "$count" :  "_result"
    ]
} }
project = { "$project" : {
    "count" : { "$arrayElemAt" : [ "$count._result" , 0 ] }
} }

So the above 3 way more complicated states may reduce the memory footprint compared to the single $group. But not for the in memory sort on a computed field.

Daniel_Li1 · September 19, 2023, 4:41pm

Thanks a lot, @steevej ! I learned something very useful. Have a nice day!