Cache the count for query? Quicker way to get paginated results for imbalanced collection

Sander_de_Ruiter · April 26, 2023, 4:00pm

Using Motor in Python, I have a Collection that has 10M documents, with 4 fields: sender, receiver, amount and height.

I’m trying to fetch documents where either sender or receiver matches a predefined string. As I need to use pagination, I need the total count as well.

Current pipeline:

pipeline = [
            {
                "$match": {
                    "$or": [
                        {"sender": {"$eq": account_string}},
                        {"receiver": {"$eq": account_string}},
                    ]
                }
            },
            {"$sort": {"height": DESCENDING}},
            {"$project": {"_id": 1}},
            {
                "$facet": {
                    "metadata": [{"$count": "total"}],
                    "data": [{"$skip": skip}, {"$limit": 20}],
                }
            },
            {
                "$project": {
                    "data": 1,
                    "total": {"$arrayElemAt": ["$metadata.total", 0]},
                }
            },
        ]

My collection has indices on sender and receiver.

I execute this pipeline using:

result = (
            await db.my_collection
            .aggregate(pipeline)
            .to_list(20)
        )

This executes at very acceptable speeds to almost all values of account_string. However, there is 1 account_string that has 9M out of 10M matches. For this account_string, execution time nears 20 sec. What’s worse, if I paginate (increase/decrease the skip amount), I again have to wait 20 sec.

I believe the slowness is due to the fact that it needs to read the entire collection to calculate the count?

My question: is there a quicker way to achieve paginated results?

Kobe_W · April 27, 2023, 4:07am

The match in this case is almost a full scan of an index tree, so definitely takes long
You sort at the second stage on height, do you have index on height ?
Use limit + skip is not an ideal way for pagination in this case. This is because every time you use a different skip and limit, the index tree still has to be traversed from the very beginning until those many elements are skipped. And as a result, when skip is a big number, you will get a performance pain. What i would suggest is to create a compound index according to ESR rule, and then use {height: {$gte: last_recorded_height}} + limit for pagination. Now you have a bounded lowest value for height, smaller entries will just directly be skipped without ever visiting.
If you only need an estimate number, you can try estimated count in driver APIs. Otherwise, the overhead on an accurate count has to be there. (But why do you need total count for pagination? something like number of pages on screen? )

Sander_de_Ruiter · April 27, 2023, 7:16am

Thank you.

Agreed, it’s 90% of the table.
Indeed, I do have an index on height as well, sorry.
That’a a smart approach with the last recorded height, will definitely try that. Will need to look into ESR rule, though.
I have 4 pagination buttions (go to begin, 1 page back, 1 page forward, go to end). To accurately determine how many pages there are and to go to end, I need the total count. Happy to hear other suggestions? From the documentation it appears that estimated count only counts documents in a collection, without any filter? If so, that doesn’t seem to be a feasible solution?

Kobe_W · April 27, 2023, 2:01pm

If thats only for end page, then sort by height in reversed order and query with a limit.

BTW, total number of results is subject to change unless you use a snapshot query

Sander_de_Ruiter · April 27, 2023, 6:22pm

I may be fussy, but if I have 85 results, with pagination showing 20 items per page, I want the last button to only show me the last 5. Doing this reverse query won’t get me there, I think?

Also, I just really like the “showing items 21-40 from 85” information regarding pagination. For this I would need to total count.

Kobe_W · April 27, 2023, 6:45pm

in that case, you will have to count the total number, but as i said, the number is subject to change if not using snapshot data.