Get the latest record from each group

Gilad_Odinak · August 25, 2021, 1:14am

Given a collection of documents

{
 feat1:"string",
 feat2:"string",
 feat3:"string",
 val1:"number",
 val2:"number",
 val3:"number",
 updated_at:ISODate()
};

With millions of documents that look like these:

[
{feat1:"a",feat2:"A1",feat3:"A1-1",val1:5,val2:7,val3:3,updated_at:ISODate("2021-08-140T00:00:01z")},
{feat1:"b",feat2:"B3",feat3:"B3-77",val1:i15,val2:0,val3:9,updated_at:ISODate("2021-08-14T00:01:30z")},
{feat1:"a",feat2:"A5",feat3:"ABBB",val1:3,val2:17,val3:2,updated_at:ISODate("2021-08-14T00:03:39z")},
 ...
];

Return the latest document for each combination of features., within a time window (updated_at after certain date and before a later date)

There are a limited number of feat1,feat2,feat3 combinations (e.g. 10000), but many entries for each such combination (1000s).

I tried an aggregate query with this pipeline

 [
    {"$sort": {"updated_at": -1}},
    {"$group": {
      "_id": {"feat1": "$feat1", "feat2": "$feat2", "feat3": "$feat3"},
      "feat1": {"$first": "$feat1"},
      "feat2": {"$first": "$feat2"},
      "feat3": {"$first": "$feat3"},
      "val1": {"$first": "$val1"},
      "val2": {"$first": "$val2"},
      "val3": {"$first": "$val3"},
      "updated_at": {"$first": "$updated_at"}
    }},
    {"$project": {
      "_id": 0, "feat1":1, "feat2": 1, "feat3": 1,
      "val1":1, "val2": 1, "val3": 1, "updated_at": 1
    }}
  ]

However mongodb fails complaining the data set does not fit in memory, or takes forever (literally) to return a result

I tried various indices on the data such as

 {"key": {"_updated_at":-1}},
 {"key": {"feat1": 1,"feat2": 1,"feat3": 1}},
 {"key": {"_updated_at":-1, "feat1": 1,"feat2": 1,"feat3": 1}},
 {"key": {"feat1": 1,"feat2": 1,"feat3": 1, "_updated_at":-1}},

to no avail.

I tried to add a limit to the backward search such as

  {"$match": { /* Consider the last 30 hours (108000000) of data */
    "$expr": {"$gt": [
      "$_updated_at",
      {"$subtract": [ {"$max": "$updated_at"}, 108000000]}
    ]}
  }},

which helped some, however, once I accumulated more data it stopped working again.

Any ideas how to solve this will be greatly appreciated.

Prasad_Saya · August 25, 2021, 5:12am

Hello @Gilad_Odinak, you can use the allowDiskUse aggregate option to overcome memory restriction during sort and group stages. You can try this option and see how it works and if you are getting the desired result.

Asya_Kamsky · August 26, 2021, 6:55pm

You may have success (depending on which version you are using) with index
{"feat1": 1,"feat2": 1,"feat3": 1, "updated_at":-1}
But you need to sort on all the fields not just updated_at. (Also you had a typo, with an extra _ in front of updated_at).

Now change your pipeline to

[ {"$sort": {"feat1": 1, "feat2":1, "feat3":1, "updated_at": -1}},
{"$group": {
  "_id": {"feat1": "$feat1", "feat2": "$feat2", "feat3": "$feat3"},
  "feat1": {"$first": "$feat1"},
  "feat2": {"$first": "$feat2"},
  "feat3": {"$first": "$feat3"}
}} ]

When I try it I get covered IXSCAN rather than collection scan and it should make it faster and more efficient.
Now if you need additional values, it won’t be covered, but should still be faster than not using an index.

By the way, to add date filtering on updated_at (is it updated_at or _updated_at? you seem to use both!) do not use $expr just use regular find syntax:

 {$match:{"updated_at:{$gt: <whatever>, $lt: <whatever>}}}

Using $expr with inequality will guarantee slow performance as aggregation semantics cannot use indexes fully the way regular find semantics can.

However, I wonder how much of your problem may be mixing up the name of the updated at field …

Asya