Thanks, I am actually considering a regular instance might be more cost effective. For now I am happy optimizing and seeing where I can get the cost down to. My first day cost my 15 USD but I am now down to about 0.6-0.8 USD a day, so optimizing and restructuring the data has a had a large impact. Which regardless of serverless or not, will have a large improvement on performance for my users.
While long term I want to get it as cheap as possible (owing to the service having no monetization), for now I am happy that I’ve got it down to a manageable amount.
Below is a sample document from the collection, the collection contains 1.5 million documents. The fields we are interested in are twitch
and timestamp
. Essentially what I am trying to do is retrieve the most recent 20 documents, but only once for each unique twitch
. So if twitch user piercyttv
has 10 documents, in the most recent list, i’d only retrieve the most recent one, and then return 9 other twitch users most recent documents. That way I get the 20 most recent activities, but only one per user.
{
"_id": {
"$oid": "62b97709a564afe29b87d3a0"
},
"twitch": "piercyttv",
"__v": 0,
"accuracy": "",
"difficulty": 4,
"endSongTime": "1.79",
"endType": 2,
"fullCombo": false,
"is360": false,
"is90": false,
"noFail": false,
"oneSaber": false,
"practiceMode": false,
"song": {
"_id": {
"$oid": "5ecc03ffa468ce001df66aac"
},
"easy": true,
"normal": true,
"hard": true,
"expert": true,
"expertPlus": true,
"wipMap": false,
"songName": "Delightful Introduction",
"songSubName": "",
"songAuthorName": "Deathpact",
"levelAuthorName": "xScaramouche",
"hash": "567859C06D0D010987875E2579E08899331F73CC",
"coverUrl": "/cdn/847b/567859c06d0d010987875e2579e08899331f73cc.jpg",
"key": "847b",
"__v": 0
},
"timestamp": {
"$date": "2022-06-27T09:23:21.000Z"
}
}
Expected Output is an array that features the 20 most recent documents (of the above), but only one per twitch
user.
Indexes on the collection are, usage stats are all from 23rd Jun:
- id: 1 – (usage: <1/min)
- song.hash: 1 – (usage: <1/min)
- song.hash: -1 – (usage: <1/min)
- song: -1 – (usage: <1/min)
- song: 1 – (usage: <1/min)
- twitch: 1, timestamp: -1 – (usage: <2.87/min)
- timestamp:-1, twitch: 1 – (usage: <1/min)
- fullCombo: 1, twitch: 1, timestamp: -1 – (usage: <1/min)
A quick discussion on the above indexes but there are some I will likely remove.
Firstly, 2 & 3, the two song hash ones. I don’t think I need two, I think this was actually a typo in my code. So ill remove one.
- and 5. was a temporary index for data migration, I will remove these. I am also not sure how I ended up with both directions.
6 7 and 8, Originally I had just number 6. However, for this aggregation I wondered if doing 7 would help it as I realised the sort was before the grouping.
Regarding 8, this is a tip I learned from a mongo cloud engineer through my work. We were discussing my work project (different from the above project) and he suggested that … if you have a field with a low amount of values, you can create a compound index an use $in with your queries. So, my aim with 8 is to replace 6, and in my queries do: fullCombo: { $in: [true,false] }, ...
. While I have made this change, looking at my index usage it doesnt seem to have taken effect. I wonder if I need to remove index 6 for it to see 8 as the better option now.
Here’s an example query that indexes 6 and 8 were designed for. So although the usage stats show its using 6, I think 8 should be the better option. Whille this isn’t related to th original aggregate. I figured it was needed to explain my indexes.
Example Query for indexes:
return await Activity.find({
fullCombo: { $in: [true, false] }, // this was added so that 7 can replace 5.
twitch: userSearch.toLowerCase(),
})
.sort([["timestamp", "descending"]])
.skip(skip)
.limit(count)
.exec();
I am not sure how to get the average document size. However if I divide the storage size by the number of documents I get 223 bytes per document.