Poor performance of getting most recent distinct values (full index scan)

Hello, I am struggling with finding the most optimal way to get top ten recent unique values.

I have a large collection of documents with such structure:

{
    "timestamp": 1673450840197,
    "request": "new laptop models",
    "user": "user1"
}

I want to get the top 10 recent unique requests using such aggregation pipeline:

{
    $match: {
        user: "user1"
    }
},
{
    $sort: {timestamp: -1}
},
{
    $group: {
        _id: "$request",
       timestamp: { $first: "$timestamp"  }
    }
},
{
    $limit: 10
}

I have index {user: 1, timestamp: -1, request: 1}.

This query is very slow when there are a lot of different requests by user, because the whole index is scanned… I have read about DISTINCT_SCAN optimization, but I don’t know how to apply it to my query…

I tried to use index {user: 1, request: 1, timestamp: -1} and change $sort stage to $sort: {request: 1, timestamp: -1}, then DISTINCT_SCAN is applied, but the data is not sorted like I wanted in my first approach - the most recent request should be first in the result.

Could you please tell me whether I am missing something or maybe I should completely change they way of solving this task?

What exactly do you mean here? If you want most recent request after grouping by request then you just need to add another $sort (by timestamp) stage after $group, no? You’ll be using DISTINCT_SCAN for $group which is expensive, and then $sort+$limit will be coalesced by the optimizer and should be quite fast.

Asya

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.