Hi,
I want to fetch all records in a table in batches. I can use skip() and limit() but performance wise skip() would be slow. Does MongoDb have the equivalent of BigQuery’s paginationToken concept? - Read data with BigQuery API using pagination | Google Cloud
Regarding the first question on BigQuery, that’s a good one. I don’t have official docs about this, but one blog post says that pageToken approach is much faster than skip+limit: https://medium.com/bloggingtimes/the-page-token-method-in-bigquery-an-efficient-approach-to-pagination-7d8f2443c69d . I don’t know how internally this is implemented. But important to note that this pageToken can be reused in a subsequent call to BigQuery even after disconnecting (How long is the pageToken valid for? No idea I haven’t checked yet).
Regarding the second question on Mongodb paging, my problem is not with batchSize. I’m concerned about the fact that it will skip the Offset no. of records each time. Linearly skipping the offset no. of records might get progressively slower if you are fetching the latter parts of a large collection?
My usecase requires me to disconnect after fetching each page and then reconnecting. Hence a Page Token kind of approach is what I was looking for.
@Pramod_Biligiri Ah I see, I don’t 100% know if in Mongo limit + skip does a linear skip of the offset but that sounds unlikely. Considering that blog posts describes pageTokens as:
The page token method is a cursor-based mechanism for pagination in BigQuery that allows you to efficiently retrieve a specific page of results from a query. The page token is generated after the first query and passed in the subsequent queries to determine the starting point for the next page of results.
And all MongoDB queries are cursor-based, it seems that they would be roughly equivalent. I think possibly older MongoDB versions had to manually go through each part of a collection though so I’m unsure which version you’re running. I think the only way to know for sure is to do some load testing with something like https://locust.io/ or something else equivalent. And see what the performance looks like for you. I’ve personally never encountered any performance problems with skip + limit in my usage.
Edit:
Just went through the docs and I was not able to find anything that wasn’t just skip to move cursors. So please do let me know how your investigation goes!
It does look like skip() and limit() are inefficient, and one way around this is to iterate by sorting over one of the indexed fields explicitly. Check out the section titled “Avoiding large skips” in this page - 4. Querying - MongoDB: The Definitive Guide, 2nd Edition [Book]
Skip and limit will cause performance impact with large values, and i bave seen this on our own on premise clusters
Official doc has some explanation about how skip and limit are dealt with in a sharded cluster.
I think thats because index is built with b plus tree. And with a large skip value, the scan still has to fo from root and one by one until the number is reached. Theres no number of subnodes info in the tree.
But with the obj id filter , the indexes smaller than that can be easily skipped, resulting in O height runtime.