Slow aggregation query using $group

MelvinFrohike_N_A · August 4, 2022, 3:05pm

Hello,

I have a an aggregation query that takes minutes when querying ~100k documents.

My documents have a structure like this:

{
  _id: "someAggregateId-someTaskId",
  aggregateId: "someAggregateId",
  taskId: "someTaskId",
  aggregate: { 
    id: "someAggregateId",
    content: "someContent"
 }
},
{
  _id: "someAggregateId-anotherTaskId",
  aggregateId: "someAggregateId",
  taskId: "anotherTaskId",
  aggregate: {
    id: "someAggregateId",
    content: "someChangedContent"
 }
},
{
  _id: "anotherAggregateId-someTaskId",
  aggregateId: "anotherAggregateId",
  taskId: "someTaskId",
  aggregate: { 
   id: "anotherAggregateId",
   content: "anotherContent"
 }
},
{
  _id: "anotherAggregateId-oneMoreTaskId",
  aggregateId: "anotherAggregateId",
  taskId: "oneMoreTaskId",
  aggregate: { 
   id: "anotherAggregateId",
   content: "oneMoreContent"
 }
}

My goal is to get the latest of each aggregate with a distinct aggregateId where “latest” is defined by a user-supplied list of taskIds. Documents with a taskId not in the supplied list will be ignored (this happens rarely). If the list of taskIds is this:

[ "someTaskId", "anotherTaskId" ]

I expect a result of:

[
  aggregate: {
    id: "someAggregateId",
    content: "someChangedContent"
  },
  aggregate: { 
   id: "anotherAggregateId",
   content: "anotherContent"
 }
]

My aggregation query looks like this:

[
    {
        '$addFields': {
            '_taskWeight': [
                {
                    '_taskId': 'anotherTaskId', 
                    '_score': 2
                }, {
                    '_taskId': 'someTaskId', 
                    '_score': 1
                }
            ]
        }
    }, {
        '$addFields': {
            '_taskWeight': {
                '$filter': {
                    'input': '$_taskWeight', 
                    'as': 'aggregation_item', 
                    'cond': {
                        '$eq': [
                            '$$aggregation_item._taskId', '$taskId'
                        ]
                    }
                }
            }
        }
    }, {
        '$addFields': {
            '_taskWeight': {
                '$cond': [
                    {
                        '$eq': [
                            {
                                '$size': '$_taskWeight'
                            }, 1
                        ]
                    }, {
                        '$arrayElemAt': [
                            '$_taskWeight', 0
                        ]
                    }, {
                        '_score': -1
                    }
                ]
            }
        }
    }, {
        '$addFields': {
            '_taskWeight': '$_taskWeight._score'
        }
    }, {
        '$match': {
            '_taskWeight': {
                '$not': {
                    '$eq': -1
                }
            }
        }
    }, {
        '$sort': {
            '_taskWeight': 1
        }
    }, {
        '$group': {
            '_id': '$aggregateId', 
            'aggregate': {
                '$last': '$aggregate'
            }
        }
    }, {
        '$replaceRoot': {
            'newRoot': '$aggregate'
        }
    }
]

Note that the first $addFields stage is generated on the fly with each taskId given a score based on its order. In reality, there are hundreds of taskIds. I have not been able to create an index that is used by this query and it is running very slowly.

I am aware that I could use $switch in the first $addFields to replace all 4 $addFields at the start with only one but sadly, my remote DB doesn’t support $switch yet.

Do you see any way to speed this up significantly, e.g. by using indexes or entirely different queries? I would also be able to change the data structure if needed.

Thanks in advance.

MelvinFrohike_N_A · August 4, 2022, 3:08pm

It seems I have messed up the formatting of my post. Is there any way to edit it?