Hello,
I have a an aggregation query that takes minutes when querying ~100k documents.
My documents have a structure like this:
{
_id: "someAggregateId-someTaskId",
aggregateId: "someAggregateId",
taskId: "someTaskId",
aggregate: {
id: "someAggregateId",
content: "someContent"
}
},
{
_id: "someAggregateId-anotherTaskId",
aggregateId: "someAggregateId",
taskId: "anotherTaskId",
aggregate: {
id: "someAggregateId",
content: "someChangedContent"
}
},
{
_id: "anotherAggregateId-someTaskId",
aggregateId: "anotherAggregateId",
taskId: "someTaskId",
aggregate: {
id: "anotherAggregateId",
content: "anotherContent"
}
},
{
_id: "anotherAggregateId-oneMoreTaskId",
aggregateId: "anotherAggregateId",
taskId: "oneMoreTaskId",
aggregate: {
id: "anotherAggregateId",
content: "oneMoreContent"
}
}
My goal is to get the latest of each aggregate with a distinct aggregateId where “latest” is defined by a user-supplied list of taskIds. Documents with a taskId not in the supplied list will be ignored (this happens rarely). If the list of taskIds is this:
[ "someTaskId", "anotherTaskId" ]
I expect a result of:
[
aggregate: {
id: "someAggregateId",
content: "someChangedContent"
},
aggregate: {
id: "anotherAggregateId",
content: "anotherContent"
}
]
My aggregation query looks like this:
[
{
'$addFields': {
'_taskWeight': [
{
'_taskId': 'anotherTaskId',
'_score': 2
}, {
'_taskId': 'someTaskId',
'_score': 1
}
]
}
}, {
'$addFields': {
'_taskWeight': {
'$filter': {
'input': '$_taskWeight',
'as': 'aggregation_item',
'cond': {
'$eq': [
'$$aggregation_item._taskId', '$taskId'
]
}
}
}
}
}, {
'$addFields': {
'_taskWeight': {
'$cond': [
{
'$eq': [
{
'$size': '$_taskWeight'
}, 1
]
}, {
'$arrayElemAt': [
'$_taskWeight', 0
]
}, {
'_score': -1
}
]
}
}
}, {
'$addFields': {
'_taskWeight': '$_taskWeight._score'
}
}, {
'$match': {
'_taskWeight': {
'$not': {
'$eq': -1
}
}
}
}, {
'$sort': {
'_taskWeight': 1
}
}, {
'$group': {
'_id': '$aggregateId',
'aggregate': {
'$last': '$aggregate'
}
}
}, {
'$replaceRoot': {
'newRoot': '$aggregate'
}
}
]
Note that the first $addFields stage is generated on the fly with each taskId given a score based on its order. In reality, there are hundreds of taskIds. I have not been able to create an index that is used by this query and it is running very slowly.
I am aware that I could use $switch in the first $addFields to replace all 4 $addFields at the start with only one but sadly, my remote DB doesn’t support $switch yet.
Do you see any way to speed this up significantly, e.g. by using indexes or entirely different queries? I would also be able to change the data structure if needed.
Thanks in advance.