I want to find the latest records for each groupId that are in my main/sub category who have their latest status as failed for the give main/sub category pair within a groupId.
Example data (real documents are larger with a number of more fields):
{
"groupId": "1",
"mainCategory": "breakfast",
"subCategory": "eggs",
"status": "success",
"dateField": "2023-01-20T00:00:00Z"
},
{
"groupId": "1",
"mainCategory": "breakfast",
"subCategory": "bacon",
"status": "success",
"dateField": "2023-01-20T00:00:00Z"
},
{
"groupId": "1",
"mainCategory": "breakfast",
"subCategory": "eggs",
"status": "failure",
"dateField": "2023-01-01T00:00:00Z"
}
With the above data, if i search for { mainCategory: "breakfast", subCategory: "eggs", "status": "failure"}
I’d expect to get no results back because the latest result (due to the sorting on the date) is a success.
If I did the same but searched for a status
of ‘success’ I’d expect to get one result back as the latest breakfast + eggs result in that group is a success.
This is why in the steps below I need to do the status match after the other steps, so that I don’t filter things out by status until I know if they were the latest or not.
My index
{
"mainCategory" : 1,
"subCategory" : 1,
"dateField" : -1
}
Example aggregate query
db.getCollection("my-collection").explain("executionStats").aggregate([
{
"$match": {
"mainCategory": "breakfast",
"subCategory": "eggs",
"dateField": { $gte: "2023-01-01T00:00:00Z" }
}
},
{
"$sort": {
"dateField": -1.0
}
},
{
"$group": {
"_id": {
"groupId": "$groupId",
"subCategory": "$subCategory"
},
"latestRecord": {
"$first": "$$ROOT"
}
}
},
{
"$match": {
"latestRecord.status": "failure"
}
},
{
"$limit": 100.0
},
{
"$replaceRoot": {
"newRoot": "$latestRecord"
}
}
])
Execution Stats
{
"executionStats" : {
"executionSuccess" : true,
"executionTimeMillis" : "663941.077",
"planningTimeMillis" : "0.530",
"executionStages" : {
"stage" : "SUBSCAN",
"nReturned" : "100",
"executionTimeMillisEstimate" : "663921.912",
"inputStage" : {
"stage" : "LIMIT_SKIP",
"nReturned" : "100",
"executionTimeMillisEstimate" : "663914.564",
"inputStage" : {
"stage" : "SUBSCAN",
"nReturned" : "100",
"executionTimeMillisEstimate" : "663914.535",
"inputStage" : {
"stage" : "SORT_AGGREGATE",
"nReturned" : "100",
"executionTimeMillisEstimate" : "663914.492",
"inputStage" : {
"stage" : "SORT",
"nReturned" : "8547",
"executionTimeMillisEstimate" : "663655.029",
"inputStage" : {
"stage" : "IXSCAN",
"nReturned" : "461546",
"executionTimeMillisEstimate" : "660287.707",
"indexName" : "myIndex", // It is correctly using my index here
"direction" : "forward"
}
}
}
}
}
}
},
"serverInfo" : {
"host" : "my-database",
"port" : 27031.0,
"version" : "4.0.0" // DocumentDB, if there is no way around this issue, could consider switching to Atlas
},
I can’t figure out if my index is just bad, or if there is a better way to do the query.
If I keep running the same query, it does get very fast. Issue is those initial queries take so long that it times out through AWS API Gateway