Poor performance of getting most recent distinct values (full index scan)

Yaroslav_Savluk · January 11, 2023, 3:44pm

Hello, I am struggling with finding the most optimal way to get top ten recent unique values.

I have a large collection of documents with such structure:

{
    "timestamp": 1673450840197,
    "request": "new laptop models",
    "user": "user1"
}

I want to get the top 10 recent unique requests using such aggregation pipeline:

{
    $match: {
        user: "user1"
    }
},
{
    $sort: {timestamp: -1}
},
{
    $group: {
        _id: "$request",
       timestamp: { $first: "$timestamp"  }
    }
},
{
    $limit: 10
}

I have index {user: 1, timestamp: -1, request: 1}.

This query is very slow when there are a lot of different requests by user, because the whole index is scanned… I have read about DISTINCT_SCAN optimization, but I don’t know how to apply it to my query…

I tried to use index {user: 1, request: 1, timestamp: -1} and change $sort stage to $sort: {request: 1, timestamp: -1}, then DISTINCT_SCAN is applied, but the data is not sorted like I wanted in my first approach - the most recent request should be first in the result.

Could you please tell me whether I am missing something or maybe I should completely change they way of solving this task?

Asya_Kamsky · January 11, 2023, 5:47pm

What exactly do you mean here? If you want most recent request after grouping by request then you just need to add another $sort (by timestamp) stage after $group, no? You’ll be using DISTINCT_SCAN for $group which is expensive, and then $sort+$limit will be coalesced by the optimizer and should be quite fast.

Asya

system · January 16, 2023, 5:48pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.