Performance of $gte, $lt range queries

Gareth_Furnell · October 21, 2024, 10:23am

Hi There

I’ve noticed across many clusters that for any production data, an hourly or daily extract is always necessary for reporting, these are normally collected with range aggregations, matching on the data/field and then using a

{"$lt": 1729433589279,
"$gte": 1729261089279}

which starts to impact the database heavily across many different implementations of this,
I did an explain and even with indexes being on all correct fields, more keysExamined are done,
sometimes due to this logic in the explainPlan:

stage: 'FETCH',
      filter: { 'field2.createdOn': { '$gte': 1729202401000 } }
indexBounds: {
          field1: [
            "[ObjectId('matched'), ObjectId('matched')]"
          ],
          'field2.createdOn': [ '(1729288799000.0, -inf.0]' ]   //($lt value)
        }

is there any way that I can communicate with reporting teams to access the data differently for faster performance?

Kindest,
Gareth Furnell

Michael_Lynn · October 21, 2024, 11:53am

sounds like you’re hitting performance issues with your range queries, especially in situations where $lt and $gte comparisons are heavily utilized across fields like createdOn. Even with proper indexing, the FETCH stage can still be a performance bottleneck due to the filtering happening after index access. perhaps try the following to improve perf and have a productive conversation with your reporting teams:

1. Use Compound Indexes More Effectively

Compound Index: If you’re filtering by both field1 and field2.createdOn, ensure that you have a compound index covering both fields. The order of fields in the index matters. MongoDB can make better use of the index if the range query ($gte/$lt) happens on a field that’s indexed after equality comparisons.
Example:

db.collection.createIndex({ "field1": 1, "field2.createdOn": 1 })

2. Consider Time Bucketing for Large Date Ranges

If your queries span long time ranges, one strategy is to pre-aggregate data into larger time buckets (like daily or hourly aggregates), so your queries hit smaller, precomputed datasets instead of the raw data. You can store these buckets in a separate collection or even in a data warehouse depending on your architecture.

Encourage your reporting teams to pull from these aggregated datasets rather than querying the entire raw data set every time.

3. Materialized Views for Common Queries

If a few of these queries are repeatedly performed (e.g., daily summaries), consider creating materialized views or precomputed results for these use cases. You could use MongoDB’s $out or $merge stage in an aggregation pipeline to store frequent query results.

4. Explain Plan Review and Adjustments

You already noticed that even with indexes, FETCH operations still occur. You might want to review the cardinality of your indexed fields—fields with low cardinality might not provide much benefit from indexing.

also… perhaps have your reporting team refine their queries to be as specific as possible, reducing the need for wide range scans. e.g… narrowing down their time ranges or using filters that reduce the size of the result set early in the pipeline.

5. Shard by Time if Clusters are Sharded

If your clusters are sharded, you can shard the collection by a time-based shard key. This can ensure that queries targeting specific time ranges only hit certain shards, reducing the load across the entire cluster.

6. Performance Monitoring Tools

you could also encourage the team to use MongoDB Atlas Performance Advisor or the Query Profiler to monitor the impact of these queries. If you are not using MongoDB Atlas, you can leverage similar MongoDB tools for on-premise deployments.