Optimizing an aggregate that gets the most recent 20 documents, but only one per username

Piercy · June 23, 2022, 9:32pm

Hi,

I’m looking at the best way to optimize an aggregate. My main aim is to bring the costs down for mongo serverless, which I think main approach to this would be to reduce the number of RPUs this aggregate costs.

Essentially what I have is a collection of Activities that users have done. I would like to see the most recent 20 activities, but only once per user. For Example, If a user named “John” made two actions recently, I’d only see the most recent. The second document would then be a different user. Currently my collection has 1.5 million documents

Here’s an example of my current aggregation:

[
    // This is just to filter the result set down
    { $match: { timestamp: { $gte: yesterday } } }, 
    // Next I sort by timestamp
    { $sort: { timestamp: -1 } },
    // Then group by the username grabbing the whole document
    {
      $group: {
        _id: "$username",
        doc: { $first: "$$ROOT" },
      },
    },
    // another sort to bring the most recent to the front
    { $sort: { "doc.timestamp": -1 } },
    // Grab the most recent 20
    { $limit: 20 },
    // Finally replace the root
    { $replaceRoot: { newRoot: "$doc" } },
  ]

I am considering that it might be better to have a second collection that only has the most recent activity per user. I can then upsert into this table on every new activity. From that point, I could probably do a simple query sorting by timestamp: -1 and limiting to 20.

However, its quite hard to work out whats more performant (and whats more costly) as now we are switching between RPU and WPU.

It also seems that the explain doesn’t seem to be able to show WPU or RPU, so I am really flying blind when it comes to calculating this stuff.

Piercy · June 27, 2022, 10:28am

I ended up creating a new collection and upserting into this collection based on the username. This way I can index both username and timestamp and its a simply sorted query, limited to 20.

Far simpler than the aggregation, far less intensive, and as my WPU rate is so low (in comparison). I am sure this will be cheaper and more performant than the aggregation.

It would be nice to still get an answer on the best way to handle that aggregation though. It seems to be an example that mongo cannot handle in a very performant way. So i would be interested to see if theres a solution beyond having another collection

Aasawari · June 29, 2022, 3:49am

Hi @Piercy and welcome to the community!!

I believe the approach that you described is a materialised view, and I think it’s a valid approach, especially if you’re willing to trade disk space vs. time (e.g. you end up with a relatively simpler query and workflow, with disk space as a price)

Having said that, I’m interested to see if the aggregation you posted earlier can be improved. Could you post some additional information:

A sample document(s) with all the fields.
expected output
Current indexes if present for the collection
Average document size.

Regarding Serverless pricing, please see Serverless Instance Costs. Notably mentioned in the page, Serverless instances may be more cost effective for applications with low or intermittent traffic.
Thus if your workload is more regular and involve a lot of data, a regular server instance may be more effective in the long run. However this is a generalisation, so you might want to double check with your actual expected workload.

Please let us know if you have further questions.

Thanks
Aasawari

Piercy · June 29, 2022, 8:58am

Thanks, I am actually considering a regular instance might be more cost effective. For now I am happy optimizing and seeing where I can get the cost down to. My first day cost my 15 USD but I am now down to about 0.6-0.8 USD a day, so optimizing and restructuring the data has a had a large impact. Which regardless of serverless or not, will have a large improvement on performance for my users.

While long term I want to get it as cheap as possible (owing to the service having no monetization), for now I am happy that I’ve got it down to a manageable amount.

Below is a sample document from the collection, the collection contains 1.5 million documents. The fields we are interested in are twitch and timestamp. Essentially what I am trying to do is retrieve the most recent 20 documents, but only once for each unique twitch. So if twitch user piercyttv has 10 documents, in the most recent list, i’d only retrieve the most recent one, and then return 9 other twitch users most recent documents. That way I get the 20 most recent activities, but only one per user.

{
    "_id": {
        "$oid": "62b97709a564afe29b87d3a0"
    },
    "twitch": "piercyttv",
    "__v": 0,
    "accuracy": "",
    "difficulty": 4,
    "endSongTime": "1.79",
    "endType": 2,
    "fullCombo": false,
    "is360": false,
    "is90": false,
    "noFail": false,
    "oneSaber": false,
    "practiceMode": false,
    "song": {
        "_id": {
            "$oid": "5ecc03ffa468ce001df66aac"
        },
        "easy": true,
        "normal": true,
        "hard": true,
        "expert": true,
        "expertPlus": true,
        "wipMap": false,
        "songName": "Delightful Introduction",
        "songSubName": "",
        "songAuthorName": "Deathpact",
        "levelAuthorName": "xScaramouche",
        "hash": "567859C06D0D010987875E2579E08899331F73CC",
        "coverUrl": "/cdn/847b/567859c06d0d010987875e2579e08899331f73cc.jpg",
        "key": "847b",
        "__v": 0
    },
    "timestamp": {
        "$date": "2022-06-27T09:23:21.000Z"
    }
}

Expected Output is an array that features the 20 most recent documents (of the above), but only one per twitch user.

Indexes on the collection are, usage stats are all from 23rd Jun:

id: 1 – (usage: <1/min)
song.hash: 1 – (usage: <1/min)
song.hash: -1 – (usage: <1/min)
song: -1 – (usage: <1/min)
song: 1 – (usage: <1/min)
twitch: 1, timestamp: -1 – (usage: <2.87/min)
timestamp:-1, twitch: 1 – (usage: <1/min)
fullCombo: 1, twitch: 1, timestamp: -1 – (usage: <1/min)

A quick discussion on the above indexes but there are some I will likely remove.

Firstly, 2 & 3, the two song hash ones. I don’t think I need two, I think this was actually a typo in my code. So ill remove one.

and 5. was a temporary index for data migration, I will remove these. I am also not sure how I ended up with both directions.

6 7 and 8, Originally I had just number 6. However, for this aggregation I wondered if doing 7 would help it as I realised the sort was before the grouping.

Regarding 8, this is a tip I learned from a mongo cloud engineer through my work. We were discussing my work project (different from the above project) and he suggested that … if you have a field with a low amount of values, you can create a compound index an use $in with your queries. So, my aim with 8 is to replace 6, and in my queries do: fullCombo: { $in: [true,false] }, .... While I have made this change, looking at my index usage it doesnt seem to have taken effect. I wonder if I need to remove index 6 for it to see 8 as the better option now.

Here’s an example query that indexes 6 and 8 were designed for. So although the usage stats show its using 6, I think 8 should be the better option. Whille this isn’t related to th original aggregate. I figured it was needed to explain my indexes.

Example Query for indexes:

return await Activity.find({
    fullCombo: { $in: [true, false] }, // this was added so that 7 can replace 5.
    twitch: userSearch.toLowerCase(),
  })
    .sort([["timestamp", "descending"]])
    .skip(skip)
    .limit(count)
    .exec();

I am not sure how to get the average document size. However if I divide the storage size by the number of documents I get 223 bytes per document.

Piercy · June 29, 2022, 9:00am

I can’t seem to edit my post but I noticed a mistake in the code example. The comment on the line with fullCombo, should be for indexed 8 and 6. As below.

return await Activity.find({
    fullCombo: { $in: [true, false] }, // this was added so that 8 can replace 6.
    twitch: userSearch.toLowerCase(),
  })
    .sort([["timestamp", "descending"]])
    .skip(skip)
    .limit(count)
    .exec();