Reduce time taken by $group stage in aggregation pipeline

Chimaro · December 22, 2022, 4:55pm

We are using MongoDB to store analytics records for our application.

We have collections that has huge amount of data ~200M.

We have a reporting functionality that uses aggregation and is insanely taking huge amount of time (~60 seconds)
We tried adding couple of indices to make the query PROJECTION_COVERED but no luck on reduced time,

If there is anything we could do to improve the performance, I would really appreciate.

Schema:

{
  _id: ObjectId,
  agent: String,
  organisation: String
  session: String
  timestamp: Date
  text: String
  ...
  ...
  ...
}

Indexes

Name	Type	Size	Usage	Properties
_id	Regular	89MB	4 Since Thu Dec	Unique
agent_organisation	Regular	90MB	9 Since Thu Dec	Compound
agent_organisation_session_timestamp	Regular	180MB	14 Since Thu Dec	Compound

Pipeline:

db.getCollection("analytics")
  .aggregate(
        [
            {
                $match: {
                    agent: "6194df628be5f9a484979dee",
                    organisation: "61387a6d02cce479622bff92"
                }
            },
            {
                $group: {
                    _id: {
                        session: "$session"
                    },
                    start: { $min: "$timestamp" },
                    end: { $max: "$timestamp" },
                    count: {
                        $sum: 1
                    }
                }
            },
            {
                $project: {
                    _id: 0,
                    session: "$_id.session",
                    start: 1,
                    end: 1,
                    count: 1
                }
            },
            {
                $sort: {
                    count: -1
                }
            },
            {
                $limit: 1000
            }
        ]);

Explain Result

...
...
"queryPlanner" : {
    "plannerVersion" : 1.0,
    "namespace" : "analytics.analytics",
    "indexFilterSet" : false,
    "parsedQuery" : {
        "$and" : [
            {
                "organisation" : {
                    "$eq" : "61387a6d02cce479622bff92"
                }
            },
            {
                "agent" : {
                    "$eq" : "6194df628be5f9a484979dee"
                }
            }
        ]
    },
    "queryHash" : "7D53FE85",
    "planCacheKey" : "E7A83797",
    "winningPlan" : {
        "stage" : "PROJECTION_COVERED",
        "transformBy" : {
            "session" : 1.0,
            "timestamp" : 1.0,
            "_id" : 0.0
        },
        "inputStage" : {
            "stage" : "IXSCAN",
            "keyPattern" : {
                "agent" : 1.0,
                "organisation" : 1.0,
                "session" : 1.0,
                "timestamp" : 1.0
            },
            "indexName" : "agent_organisation_session_timestamp",
...
...
...

Is there anything we could do to improve the performance of the query?