Poor indexed query Performance in large collection

nikolasdas · October 8, 2023, 11:59am

Hey there,

I have a large collection of ~70m documents. I want to query the collection based on a time period. The field does have an index and according to the explain command the index is used.

db.requests.find({ createdAt: { $gte: new Date("2023-01-01"), $lt: new Date("2023-02-01") } }).explain("executionStats")

The execution stats are:

executionStats: {
  executionSuccess: true,
  nReturned: 4116735,
  executionTimeMillis: 10913,
  totalKeysExamined: 4116735,
  totalDocsExamined: 4116735,
  executionStages: {
    stage: 'FETCH',
    nReturned: 4116735,
    executionTimeMillisEstimate: 1634,
    works: 4116736,
    advanced: 4116735,
    needTime: 0,
    needYield: 0,
    saveState: 4116,
    restoreState: 4116,
    isEOF: 1,
    docsExamined: 4116735,
    alreadyHasObj: 0,
    inputStage: {
      stage: 'IXSCAN',
      nReturned: 4116735,
      executionTimeMillisEstimate: 711,
      works: 4116736,
      advanced: 4116735,
      needTime: 0,
      needYield: 0,
      saveState: 4116,
      restoreState: 4116,
      isEOF: 1,
      keyPattern: {
        createdAt: 1
      },
      indexName: 'createdAt_1',
      isMultiKey: false,
      multiKeyPaths: {
        createdAt: []
      },
      isUnique: false,
      isSparse: false,
      isPartial: false,
      indexVersion: 2,
      direction: 'forward',
      indexBounds: {
        createdAt: [
          '[new Date(1672531200000), new Date(1675209600000))'
        ]
      },
      keysExamined: 4116735,
      seeks: 1,
      dupsTested: 0,
      dupsDropped: 0
    }
  }
}

Any idea why the query takes 10 seconds and if there is anything I can do to improve the performance?

steevej · October 8, 2023, 1:38pm

The index seems to perform correctly.

What I suspect is that the mongod server spend most of its time reading the documents from permanent storage. Do you have any disk usage and ram usage numbers?

What you can do to improve the performance depends on what you want to do with the 4M documents returned by the query. It also depends on the size of the documents. For example, if you want to compute some values (an average, a sum, a max …) of a given field, then you may add that field to the index so the query becomes a covered query.

nikolasdas · October 8, 2023, 2:28pm

Thank you for your reply. I don’t see much of a spike in disk or ram usage while running the query on my PC.
What I want to do is: I have this large collection of API requests. The documents are small, just some metadata about the request and the user who made the API call.
What I need is: given a large list of users and a timespan, query for each user if there is a request in that period
Currently, I’m looping over each user and use findOne to determine if they have requests in the timespan, because doing many (concurrent) smaller database calls seems to be faster.
It works, but I would like to find a solution where need to query the database just once, without loosing performance

chris · October 8, 2023, 3:40pm

Sounds like a job for https://www.mongodb.com/docs/manual/aggregation/

nikolasdas · October 8, 2023, 4:27pm

Sure thing.
But with ~6k users and ~1.4m requests for the users in the given timespan, looping over each user and using findOne is way faster (~14x) than running an aggregation

steevej · October 8, 2023, 6:42pm

We are more interested about the disk and ram on the mongod server rather than the disk and ram of the PC where you run your query.

If both mongod and the client code are running on the same machine then it is possible that many requests is faster than 1 request because there is no network latency. You might experience a different behavior if the server is on a different network.

For some analytic use-cases, it might be better if they run slower in order to leave more bandwidth for other use-cases. Unless you schedule the use-case outside the high usage periods.

May be the pipeline is not optimal.

In your findOne timing, do you consider the time to get the 6k users from the server to the client. If you only time the local iteration of the users and their findOne() then you are not comparing the same.

The cache might also have a role. If you run the single aggregation first and the multiple findOne(), then may be the working set of the use-case is already in memory.

But without access to the exact code of both scenario is it really hard to make an assessment.

nikolasdas · October 10, 2023, 8:09am

Yes, these results are from my personal computer, running the server and query on the same machine. So I guess you’re right, in production there would be more network overhead, although the systems are hosted in the same data center.
I tried my loop variant again after a fresh restart, and the cache seems to be a huge factor, the first run takes a very long time. I guess I have to choose between long cold start and snappy performance after that or more consistent, but longer runtime with the aggregation.

The real problem is the data structure of course, but I’m not quite sure yet how to change that. Since I don’t need the information about every single request for my described use case, but only the information if a user was active on a given date, my ideas are:

have a collection and store for each user the days they where active as { user, days[] } and using $addToSet to insure no duplicates
have a collection and store for each day which users where active { day, users[] } and using $addToSet to insure no duplicates
have a collection and store a user and a day { user, day }, using a unique index to insure no duplicates

I know this is diverging a bit from the original question, but any suggestion what option to pick or if there is a better solution?

steevej · October 10, 2023, 6:22pm

About 1. and 2.

a) When you update an array the whole array is written back to disk. So these 2 might be slower than 3.

b) $addToSet is 0(n) since the array need to be scan to prevent duplicate. It should be better to use $push with an appropriate query as to select 0 document is entry is already present.

About 3.

a) My preferred without testing. Only testing can pinpoint the best. The unique index user:1,day:1 might perform better for updates, while the unique index day:1,user:1 might for the query of the original post.