FindCursor size not available

Fredrik_Fager1 · March 2, 2023, 9:06am

Reading large amounts of data from a collection using FindCursor in batches of 10k, and everything works really well. My problem is that I don’t know how many documents there will be in total, and the cursor size (https://www.mongodb.com/docs/manual/reference/method/cursor.size/) feature does not seem to be available in the node-js driver. I can do two separate queries, one to count, but then there is no guarantee that the count is correct.

I would expect that the cursor size is known, and therefore would solve my issue.

Tarun_Gaur · March 10, 2023, 7:09am

Hello @Fredrik_Fager1 ,

Welcome to The MongoDB Community Forums!

A cursor fetches documents in batches to reduce both memory consumption and network bandwidth usage. Cursors are highly configurable and offer multiple interaction paradigms for different use cases.

Querying and Counting are typically two different operations. One can do a query to get data that matched the query criteria. Whereas, Count will tell you the number of documents that match the query criteria.

If you want to count the documents via cursor then you can execute the cursor, add all return documents in an array and do a count. Below is a small snippet:

  const Result = await cursor.toArray();
  console.log("Count: " + Result.length);

Why do you believe that you will not get an exact count?

Regards,
Tarun

Fredrik_Fager1 · March 13, 2023, 7:36am

Welcome to The MongoDB Community Forums!

Thank you!

const Result = await cursor.toArray();

Why do you believe that you will not get an exact count?

That does give the exact amount, but toArray fetches all documents into an array at once. This is what I want to avoid. I want to read really large amounts of documents, without using huge amounts of memory while parsing them. My test data set results in 4GiB of memory if I load it all, while reading it in batches of 10k memory usage is ~200 MiB.

I’m reading using:

for await (const document of cursor) {

and batchSize set to 10k.

If I make two separate queries the data might change between the first counting query and the second fetching the data, i.e. not guaranteed to contain the same number of documents.

Given that there is a cursor.size method available in MongoDB, I’d assume this is the reason for its existence. The cursor probably knows how many documents matched.

Tarun_Gaur · March 20, 2023, 11:53pm

Cursor.size() is a mongosh method that basically calls cursor.count() in the node driver (mongosh uses the node driver), please refer

github.com

mongodb-js/mongosh/blob/v1.8.0/packages/shell-api/src/cursor.ts#L186-L189


      
          @returnsPromise
          async size(): Promise<number> {
            return this._cursor.count();
          }

However, cursor.count() is deprecated in the recent versions of the Node driver.
Running that in current mongosh shows

Warning: cursor.count is deprecated and will be removed in the next major version, please use collection.estimatedDocumentCount or collection.countDocuments instead

as per the error message in mongosh you can use collection.countDocuments().

The countDocuments() implementation (please refer this source) in the node driver is basically an aggregation of

[ { $match: query }, { $group: { _id: 1, n: { $sum: 1 } } } ]

In conclusion, the cursor.size() method actually executes the query, bringing back our earlier point that db.collection.find(...) and db.collection.find(...).count() are two separate commands which means you can either count the number of documents on a cursor, or you can return those documents then count them later.

If the count is important then you can try using MongoDB Transactions where you can do whole operation in single transaction.

Lastly, can you clarify, why do you need the count, when you are processing the documents one at a time? Why does the count matter in this case?

Fredrik_Fager1 · March 21, 2023, 9:18am

A bit of background: I’m working on a multi-tenant service, where we introduce a feature to export all data according to a set of permissions and rules. When exporting the data there are about 15 collections from which data is exported and then compressed into a gzip stream, document by document.

The simple task at hand is to provide progress information during this export operation. In the end I have about 15 queries which result in the final export, and I would like to know how many documents are matched for each query, when the operation begins. I.e I don’t want to count the documents in the collections, as that is not the count exported.

That said, I can solve the issue, while I don’t like the options available, as I would assume the cursor in MongoDB must know how many documents matched when returning the cursor. I don’t want to have the DB do unnecessary work, and I think this is something that could be used when solving several other issues, and that is why I have spent some time trying to figure this out. Apparently there is something that I don’t understand about how MongoDB is working internally with the query cursor.