Can I do paging over all records in a collection without using skip and limit?

Pramod_Biligiri · June 21, 2023, 9:55am

Hi,
I want to fetch all records in a table in batches. I can use skip() and limit() but performance wise skip() would be slow. Does MongoDb have the equivalent of BigQuery’s paginationToken concept? - Read data with BigQuery API using pagination | Google Cloud

Alternately, I came across this KeySet Pagination approach, which relies on ObjectId being monotonically increasing - MongoDB Pagination, Fast & Consistent | by Mosius | The Startup | Medium. Is that assumption about ObjectId likely to remain valid in future?

Luis_Osta1 · June 21, 2023, 6:18pm

One question and one idea:

Why is BigQuery API’s paginationToken any faster than skip/limit? Seems to be quite similar just reading that doc you shared
Have you tried using a cursor directly and setting the batchSize to control how much you get back on each iteration of the cursor? https://www.mongodb.com/docs/manual/reference/method/cursor.batchSize/. This might help give you more control of how many pages you’re getting

Pramod_Biligiri · June 22, 2023, 4:44am

Regarding the first question on BigQuery, that’s a good one. I don’t have official docs about this, but one blog post says that pageToken approach is much faster than skip+limit: https://medium.com/bloggingtimes/the-page-token-method-in-bigquery-an-efficient-approach-to-pagination-7d8f2443c69d . I don’t know how internally this is implemented. But important to note that this pageToken can be reused in a subsequent call to BigQuery even after disconnecting (How long is the pageToken valid for? No idea I haven’t checked yet).

Regarding the second question on Mongodb paging, my problem is not with batchSize. I’m concerned about the fact that it will skip the Offset no. of records each time. Linearly skipping the offset no. of records might get progressively slower if you are fetching the latter parts of a large collection?

My usecase requires me to disconnect after fetching each page and then reconnecting. Hence a Page Token kind of approach is what I was looking for.

Luis_Osta1 · June 22, 2023, 2:20pm

@Pramod_Biligiri Ah I see, I don’t 100% know if in Mongo limit + skip does a linear skip of the offset but that sounds unlikely. Considering that blog posts describes pageTokens as:

The page token method is a cursor-based mechanism for pagination in BigQuery that allows you to efficiently retrieve a specific page of results from a query. The page token is generated after the first query and passed in the subsequent queries to determine the starting point for the next page of results.

And all MongoDB queries are cursor-based, it seems that they would be roughly equivalent. I think possibly older MongoDB versions had to manually go through each part of a collection though so I’m unsure which version you’re running. I think the only way to know for sure is to do some load testing with something like https://locust.io/ or something else equivalent. And see what the performance looks like for you. I’ve personally never encountered any performance problems with skip + limit in my usage.

Edit:
Just went through the docs and I was not able to find anything that wasn’t just skip to move cursors. So please do let me know how your investigation goes!

Pramod_Biligiri · June 26, 2023, 2:39pm

It does look like skip() and limit() are inefficient, and one way around this is to iterate by sorting over one of the indexed fields explicitly. Check out the section titled “Avoiding large skips” in this page - 4. Querying - MongoDB: The Definitive Guide, 2nd Edition [Book]

kevinadi · June 27, 2023, 1:47am

Hi @Pramod_Biligiri

Just to add my 2c: I found this community blog post API Paging Built The Right Way, with the corresponding Github repo implementing the idea in GitHub - mixmaxhq/mongo-cursor-pagination: Cursor-based pagination for Mongo that may be of interest to you.

Note that those are community resources and not official MongoDB code/product, so use at your own risk

Best regards
Kevin

Kobe_W · August 12, 2023, 4:11pm

Skip and limit will cause performance impact with large values, and i bave seen this on our own on premise clusters

Official doc has some explanation about how skip and limit are dealt with in a sharded cluster.

I think thats because index is built with b plus tree. And with a large skip value, the scan still has to fo from root and one by one until the number is reached. Theres no number of subnodes info in the tree.

But with the obj id filter , the indexes smaller than that can be easily skipped, resulting in O height runtime.