Given a collection of documents
{
feat1:"string",
feat2:"string",
feat3:"string",
val1:"number",
val2:"number",
val3:"number",
updated_at:ISODate()
};
With millions of documents that look like these:
[
{feat1:"a",feat2:"A1",feat3:"A1-1",val1:5,val2:7,val3:3,updated_at:ISODate("2021-08-140T00:00:01z")},
{feat1:"b",feat2:"B3",feat3:"B3-77",val1:i15,val2:0,val3:9,updated_at:ISODate("2021-08-14T00:01:30z")},
{feat1:"a",feat2:"A5",feat3:"ABBB",val1:3,val2:17,val3:2,updated_at:ISODate("2021-08-14T00:03:39z")},
...
];
Return the latest document for each combination of features., within a time window (updated_at after certain date and before a later date)
There are a limited number of feat1,feat2,feat3 combinations (e.g. 10000), but many entries for each such combination (1000s).
I tried an aggregate query with this pipeline
[
{"$sort": {"updated_at": -1}},
{"$group": {
"_id": {"feat1": "$feat1", "feat2": "$feat2", "feat3": "$feat3"},
"feat1": {"$first": "$feat1"},
"feat2": {"$first": "$feat2"},
"feat3": {"$first": "$feat3"},
"val1": {"$first": "$val1"},
"val2": {"$first": "$val2"},
"val3": {"$first": "$val3"},
"updated_at": {"$first": "$updated_at"}
}},
{"$project": {
"_id": 0, "feat1":1, "feat2": 1, "feat3": 1,
"val1":1, "val2": 1, "val3": 1, "updated_at": 1
}}
]
However mongodb fails complaining the data set does not fit in memory, or takes forever (literally) to return a result
I tried various indices on the data such as
{"key": {"_updated_at":-1}},
{"key": {"feat1": 1,"feat2": 1,"feat3": 1}},
{"key": {"_updated_at":-1, "feat1": 1,"feat2": 1,"feat3": 1}},
{"key": {"feat1": 1,"feat2": 1,"feat3": 1, "_updated_at":-1}},
to no avail.
I tried to add a limit to the backward search such as
{"$match": { /* Consider the last 30 hours (108000000) of data */
"$expr": {"$gt": [
"$_updated_at",
{"$subtract": [ {"$max": "$updated_at"}, 108000000]}
]}
}},
which helped some, however, once I accumulated more data it stopped working again.
Any ideas how to solve this will be greatly appreciated.