What is the fastest way to filter documents with pagination in collection with millions of records?

Mauricio_Ramirez · January 19, 2023, 11:21pm

Hello! First, I know similar questions have been asked a lot but most answers are downright painfully slow. I have a 10,000,000 documents in a collection and I’m having serious issues with speed when filtering.

I’ve been using mongoose paginate v2 and speed is not absurdly painful but it is slow, taking around 27s to return the documents filtered with pagination. Recently, I learned that aggregate().skip().limit() is lightning fast with unfiltered data:

  const pageSize = +req.query.pagesize;
  const currentPage = +req.query.currentpage;

  let recordCount;

  ServiceClass.find().count().then((count) =>{
    recordCount = count;
    ServiceClass.aggregate().skip(currentPage).limit(pageSize).exec().then((documents) => {
      res.status(200).json({
        message: msgGettingRecordsSuccess,
        serviceClasses: documents,
        count: recordCount,
      });
    })
    .catch((error) => {
      res.status(500).json({ message: msgGettingRecordsError });
    });
  }).catch((error) => {
    res.status(500).json({ message: "Error obteniendo cantidad de registros" });
  });

This function returns any page within 8 ms, extremely fast and this is amazing but whenever I start filtering the issues start. First, If i just filter like this:

ServiceClass.aggregate([
    { $match: { $or: [{ code: { $regex: regex } }, { description: { $regex: regex } }] } },
    { $skip: currentPage },
    { $limit: pageSize }
])

The function takes around 14s to get the data, an improvement of almost 50% over the mongoose paginate plugin, however, I need the total amount of records without the limit so I can’t use $count inside that aggregate as I will just get pageSize, therefor, I need to run another query before:

ServiceClass.aggregate([
    { $match: { $or: [{ code: { $regex: regex } }, { description: { $regex: regex } }] } },
    { $count: "count" }
])

It kills the performance as both queries add up and in both queries I end up getting all documents, taking about 34s to complete.

So the question is as the title says: How can I get the filtered results along with total count of documents before limit as fast as possible?

both fields -code and description- have an index.

Satyam · February 6, 2023, 3:51am

Hey @Mauricio_Ramirez,

Typically, slow queries are generally not using any of the appropriate indexes. It’s important to ensure that appropriate indexes are created on the query fields to improve query performance. The MongoDB explain() method can be used to see if an index is being used for a particular query and to understand the query performance.

I also noticed you are using $or operator as well as $regex in your aggregation queries. This may cause performance issues when not designed carefully. Typically when using unanchored $regex, MongoDB scans the entire collection to find the matching documents/indexes, which can be slow for large collections. To check this, I used a sample collection of over 500,000 documents. All documents looked like this:

{
  _id: ObjectId("63db8c46002b226488ec38a3"),
  sku: 'abc123',
  description: 'First line\n' +
    'Second line'
}
{
  _id: ObjectId("63db8c46002b226488ec38a4"),
  sku: 'xyz789',
  description: 'Many spaces before     line'
}
{
  _id: ObjectId("63db8c46002b226488ec38a5"),
  sku: 'Abc789',
  description: 'Multiple\n' +
    'line description'
}
{
  _id: ObjectId("63db8c46002b226488ec38a6"),
  sku: 'abc123',
  description: 'Many spaces before     line'
}

Both sku and description have an index.

Then I tried a query similar to the one you provided.

db.sample.aggregate([ { $match: { $or: [{ sku: { $regex: /789$/ } }, { description: { $regex: /^ABC/}}] } },{ $count: "count" }]).explain('executionStats')

The output of explain("executionStats") looked like this:

executionStats: {
          executionSuccess: true,
          nReturned: 285670,
          executionTimeMillis: 628,
          totalKeysExamined: 500001,
          totalDocsExamined: 0,
  ...
}

Similarly for this query(case-insensitive regex):

db.sample.aggregate([ { $match: { $or: [{ sku: { $regex: /789$/ } }, { description: { $regex: /^ABC/i}}] } },{ $count: "count" }]).explain('executionStats')

we get:

executionStats: {
          executionSuccess: true,
          nReturned: 285670,
          executionTimeMillis: 1081,
          totalKeysExamined: 1000000,
          totalDocsExamined: 0,
   ...
}

However, if we use an anchored case-sensitive regex query, the number of index entries scanned (totalKeysExamined) dropped radically:

db.sample.aggregate([ { $match: { $or: [{ sku: { $regex: /^ABC$/ } }, { description: { $regex: /^Multiple\nline description/}}] } },{ $count: "count" }]).explain('executionStats')

we get:

 executionStats: {
          executionSuccess: true,
          nReturned: 125200,
          executionTimeMillis: 280,
          totalKeysExamined: 196700,
          totalDocsExamined: 0

Even though all queries use IXSCAN, as you can see, it’s scanning the whole index, which in some cases is not much faster than COLLSCAN. Furthermore, Unanchored regex is not the best for performance, and using $or is only making the performance worse as it scans 1 million index keys total, even though the collection only contains half that number. Meaning that it scans the whole index twice due to the use of {$or: [ {unanchored regex}, {case-insensitive regex} ] }.
You can read more about the behavior of $or and $regex from the documentation:
$or behavior
$regex and index use.

Additionally, although the skip/limit is suitable for some applications, it suffers from 2 major problems:

Poor Performance
As mentioned in MongoDB documentation for skip behavior, it requires scanning all the previous docs before reaching the desired one, so it becomes slower as the offset increases. So the response times decline as page numbers increase. You can read more about this behavior from the documentation: skip behavior
Data Inconsistency
When the collection is actively being inserted into, using skip/limit may result in missed documents in the query. It can cause shifts in data which means we may face duplicated docs or not fetching some docs in the result.

I’m also linking some additional resources that you can refer that should certainly help you:

Paging with the Bucket Pattern
A deep dive into cursor-based pagination in MongoDB (note that this is a community blog post and was not written by MongoDB)
API Paging Built The Right Way (this is also a community blog post)

Hope this helps. Please let us know if any part of the answer feels confusing or not clear.

Regards,
Satyam

system · May 5, 2023, 11:37am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.