How to accelerate an aggregate query with $sum?

Hello,

My query is something like that :

> db.mycollection.aggregate ( 
  [ { $group: {_id: "$eventType", applicationCount: {"$sum": 1} } } ], 
{ cursor: { batchSize: 32 }, allowDiskUse: false} )

The collection has above 780k items, so I think it is not huge; I have an index on the “eventType” field - but in spite of that it is very slow. With an explain I have

winningPlan: { stage: 'COLLSCAN', direction: 'forward' },
       executionStats: {
          executionSuccess: true,
          nReturned: 786389,
          executionTimeMillis: 10409,
          totalKeysExamined: 0,
          totalDocsExamined: 786389,

However, if I modify my query adding a hint, such as

db.mycollection.aggregate ( 
 [ { $group: {_id: "$eventType", applicationCount: {"$sum": 1} } } ],
  { cursor: { batchSize: 32 }, allowDiskUse: false, hint: { eventType: 1 } } 
)

explain(“executionStats”) shows an improvement:

            stage: 'PROJECTION_COVERED',
            transformBy: { eventType: 1, _id: 0 },
            inputStage: {
              stage: 'IXSCAN',
              keyPattern: { eventType: 1 },
              indexName: 'eventType_1',
              isMultiKey: false,
        executionStats: {
          executionSuccess: true,
          nReturned: 786389,
          executionTimeMillis: 5994,
          totalKeysExamined: 786389,

However, I still find the execution time long! (almost 6 seconds).
My questions are:

  • why is it not using the index directly and needs to specify it?
  • how could the query be improved? For instance on a normal relational DB the counts are performed in parallel, is it possible to perform that here too?

Most probably because you do not have $sort or $match stage.

It depends of your documents. If your entire collection does not fit in RAM then documents must be read from disk.

I would try to $sort on eventType and then $project { _id:0 , eventType:1 }, then the index should be used and luckily it fits in memory and no documents will be read from disk.

1 Like

Either I’m missing something, or it really doesn’t improve the result:

> db.myCollection.aggregate ( [ { $group: {_id: "$eventType", applicationCount: {"$sum": 1} } }, {$project: {_id:1}} ], { cursor: { batchSize: 32 }, allowDiskUse: false } ).explain("executionStats")
{
  stages: [
    {
      '$cursor': {
        query: {},
        fields: { eventType: 1, _id: 0 },
        queryPlanner: {
          plannerVersion: 1,
          namespace: 'mydb.myCollection',
          indexFilterSet: false,
          parsedQuery: {},
          queryHash: '8B3D4AB8',
          planCacheKey: '8B3D4AB8',
          winningPlan: { stage: 'COLLSCAN', direction: 'forward' },
          rejectedPlans: []
        },
        executionStats: {
          executionSuccess: true,
          nReturned: 786389,
          executionTimeMillis: 10144,
          totalKeysExamined: 0,
          totalDocsExamined: 786389,
          executionStages: {
            stage: 'COLLSCAN',
            nReturned: 786389,
            executionTimeMillisEstimate: 1735,
...

You do that before the group stage. The sort will enforce the use of the index. And the projection will not fetch the documents from disk since you only project fields in the index.

1 Like

Yes, it seems now it is using an index -> I am using

.aggregate ( [{ $sort: { eventType: 1 } }, 
{ $project: { eventType: 1, _id: 0 } }, 
{ $group: { _id: "$eventType", count: { $sum: 1 } } }] , 
{ cursor: { batchSize: 32 }, allowDiskUse: false } )

However, explain show that:

winningPlan: {
            stage: 'PROJECTION_COVERED',
            transformBy: { eventType: 1, _id: 0 },
            inputStage: {
              stage: 'IXSCAN',
              keyPattern: { eventType: 1, productId: 1 },
              indexName: 'eventType_1_productId_1',
...
executionStats: {
          executionSuccess: true,
          nReturned: 786389,
          executionTimeMillis: 16786,
          totalKeysExamined: 786389,
...

So I would rather say that it is using a suboptimal key, that spans over 2 fields -> getIndexes () shows

  {
    v: 2,
    key: { eventType: 1, productId: 1 },
    name: 'eventType_1_productId_1',
    ns: '...'
  },
...
  {
    v: 2,
    key: { eventType: 1, workspaceName: 1, productId: 1 },
    name: 'eventType_1_workspaceName_1_productId_1',
    ns: '...'
  },
...
  {
    v: 2,
    key: { eventType: 1, 'session.serviceName': 1 },
    name: 'eventType_1_session.serviceName_1',
    ns: '...'
  },
  {
    v: 2,
    key: { eventType: 1 },
    name: 'eventType_1',
    ns: '...'
  },
...

Is that a good explanation?

I do not think it matters. As long as you get an index scan and it is covered.

I think that what is sub-optimal, is to have an index that is a prefix of another one. You are just using more RAM for no real benefit as using the index eventType_1 will not give any benefit, in most case, because any query involving eventType can be served by the other indexes. I wrote in most case because may be, just may be, if you are low in RAM and eventType_1 is the only index that fits in RAM, then all others require disk access. But if you have this scenario then you have more serious problem anyway so you should not have this scenario.

OK, I have copied my collection to a new one and I have added a single index (on my interest field) so now I have

> db.myCollection.getIndexes ()
[
  {
    v: 2,
    key: { _id: 1 },
    name: '_id_',
    ns: '...'
  },
  {
    v: 2,
    key: { eventType: 1 },
    name: 'eventType_1',
    ns: '...'
  }
]

But when I execute my query (db.myCollection.aggregate ( [ { $sortByCount: "$eventType" } ], { allowDiskUse: false}).explain ("executionStats")) it still does a COLLSCAN only !!!

{
  stages: [
    {
      '$cursor': {
        query: {},
        fields: { eventType: 1, _id: 0 },
        queryPlanner: {
          plannerVersion: 1,
          namespace: '...',
          indexFilterSet: false,
          parsedQuery: {},
          queryHash: '8B3D4AB8',
          planCacheKey: '8B3D4AB8',
          winningPlan: { stage: 'COLLSCAN', direction: 'forward' },
          rejectedPlans: []
        },
        executionStats: {
          executionSuccess: true,
          nReturned: 783254,
          executionTimeMillis: 9941,
          totalKeysExamined: 0,
          totalDocsExamined: 783254,
          executionStages: {
            stage: 'COLLSCAN',
            nReturned: 783254,
            executionTimeMillisEstimate: 1181,
            works: 783256,
            advanced: 783254,
            needTime: 1,
            needYield: 0,
            saveState: 6217,
            restoreState: 6217,
            isEOF: 1,
            direction: 'forward',
            docsExamined: 783254
          }
        }
      }
    },
    {
      '$group': { _id: '$eventType', count: { '$sum': { '$const': 1 } } }
    },
    { '$sort': { sortKey: { count: -1 } } }
  ],
  serverInfo: {
    host: 'mongo-mongodb-56c7dffc8c-f5mmj',
    port: 27017,
    version: '4.2.4',
    gitVersion: 'b444815b69ab088a808162bdb4676af2ce00ff2c'
  },
  ok: 1
}

The only way to force index usage is to add the hint; I wouldn’t mind it too much, but even in this case it is very slow (for a 783k items collection), and there are about 143 different values for eventType:

{
  stages: [
    {
      '$cursor': {
        query: {},
        fields: { eventType: 1, _id: 0 },
        queryPlanner: {
          plannerVersion: 1,
          namespace: '...',
          indexFilterSet: false,
          parsedQuery: {},
          queryHash: 'EFB2EDD9',
          planCacheKey: 'EFB2EDD9',
          winningPlan: {
            stage: 'PROJECTION_COVERED',
            transformBy: { eventType: 1, _id: 0 },
            inputStage: {
              stage: 'IXSCAN',
              keyPattern: { eventType: 1 },
              indexName: 'eventType_1',
              isMultiKey: false,
              multiKeyPaths: { eventType: [] },
              isUnique: false,
              isSparse: false,
              isPartial: false,
              indexVersion: 2,
              direction: 'forward',
              indexBounds: { eventType: [ '[MinKey, MaxKey]' ] }
            }
          },
          rejectedPlans: []
        },
        executionStats: {
          executionSuccess: true,
          nReturned: 783254,
          executionTimeMillis: 6770,
          totalKeysExamined: 783254,
          totalDocsExamined: 0,
          executionStages: {
            stage: 'PROJECTION_COVERED',
            nReturned: 783254,
            executionTimeMillisEstimate: 729,
            works: 783255,
            advanced: 783254,
            needTime: 0,
            needYield: 0,
            saveState: 6206,
            restoreState: 6206,
            isEOF: 1,
            transformBy: { eventType: 1, _id: 0 },
            inputStage: {
              stage: 'IXSCAN',
              nReturned: 783254,
              executionTimeMillisEstimate: 550,
              works: 783255,
              advanced: 783254,
              needTime: 0,
              needYield: 0,
              saveState: 6206,
              restoreState: 6206,
              isEOF: 1,
              keyPattern: { eventType: 1 },
              indexName: 'eventType_1',
              isMultiKey: false,
              multiKeyPaths: { eventType: [] },
              isUnique: false,
              isSparse: false,
              isPartial: false,
              indexVersion: 2,
              direction: 'forward',
              indexBounds: { eventType: [ '[MinKey, MaxKey]' ] },
              keysExamined: 783254,
              seeks: 1,
              dupsTested: 0,
              dupsDropped: 0
            }
          }
        }
      }
    },
    {
      '$group': { _id: '$eventType', count: { '$sum': { '$const': 1 } } }
    },
    { '$sort': { sortKey: { count: -1 } } }
  ],

Why totalKeysExamined is equal with the number of documents instead of being equal with the number of distinct values?

The same thing is is using your suggestion:

> db.mycollection.explain("executionStats").aggregate ( [{ $sort: { eventType: 1 } }, 
{ $project: { eventType: 1, _id: 0 } }, 
{ $group: { _id: "$eventType", count: { $sum: 1 } } }] ,
{ cursor: { batchSize: 320 }, allowDiskUse: false } )
{
  stages: [
    {
      '$cursor': {
        query: {},
        sort: { eventType: 1 },
        fields: { eventType: 1, _id: 0 },
        queryPlanner: {
          plannerVersion: 1,
          namespace: '...',
          indexFilterSet: false,
          parsedQuery: {},
          queryHash: '34AFD5A6',
          planCacheKey: '34AFD5A6',
          winningPlan: {
            stage: 'PROJECTION_COVERED',
            transformBy: { eventType: 1, _id: 0 },
            inputStage: {
              stage: 'IXSCAN',
              keyPattern: { eventType: 1 },
              indexName: 'eventType_1',
              isMultiKey: false,
              multiKeyPaths: { eventType: [] },
              isUnique: false,
              isSparse: false,
              isPartial: false,
              indexVersion: 2,
              direction: 'forward',
              indexBounds: { eventType: [ '[MinKey, MaxKey]' ] }
            }
          },
          rejectedPlans: []
        },
        executionStats: {
          executionSuccess: true,
          nReturned: 783254,
          executionTimeMillis: 7728,
          totalKeysExamined: 783254,
          totalDocsExamined: 0,
          executionStages: {
            stage: 'PROJECTION_COVERED',
            nReturned: 783254,
            executionTimeMillisEstimate: 991,
            works: 783255,
            advanced: 783254,
            needTime: 0,
            needYield: 0,
            saveState: 6213,
            restoreState: 6213,
            isEOF: 1,
            transformBy: { eventType: 1, _id: 0 },
            inputStage: {
              stage: 'IXSCAN',
              nReturned: 783254,
              executionTimeMillisEstimate: 868,
              works: 783255,
              advanced: 783254,
              needTime: 0,
              needYield: 0,
              saveState: 6213,
              restoreState: 6213,
              isEOF: 1,
              keyPattern: { eventType: 1 },
              indexName: 'eventType_1',
              isMultiKey: false,
              multiKeyPaths: { eventType: [] },
              isUnique: false,
              isSparse: false,
              isPartial: false,
              indexVersion: 2,
              direction: 'forward',
              indexBounds: { eventType: [ '[MinKey, MaxKey]' ] },
              keysExamined: 783254,
              seeks: 1,
              dupsTested: 0,
              dupsDropped: 0
            }
          }
        }
      }
    },
    {
      '$group': { _id: '$eventType', count: { '$sum': { '$const': 1 } } }
    }
  ],
... 

And in all cases the query duration is something above 6-7 seconds, that I consider too much (and I’am afraid that it will depend of number of items in the collection).