Non-Indexed match query should be before or after project?

I have a compound index with {ts : 1, field1 : 1}

I have an aggregation pipeline that starts as below

{
    "$match":{
        "ts":{
            "$gte":"startDate",
            "$lt":"endDate"
        },
        "field1":"myvalue"
    }
},
{
    "$sort":{
        "ts":1
    }
},
{
    "$project":{
        "ts":true,
        "field1":true,
        "field2":true
    }
}
//other stages

I dont want to read the entire document, so I am using $project to take only required fields.

Before I go to the other stages I want to filter out the docs which have { field2 : { $gt : 0 } }.

field2 is not indexed. So should I put the above condition in the first $match stage or should I have a separate $match after $project.

Will mongodb have to read the entire doc from the disk if I put it in the first stage? Or will mongodb do that anyways and filter out the fields after $project?

Hi @Dushyant_Bangal,

Welcome to MongoDB community!

First if you have a set of “and” expressions that could be placed in first stage you should do that to pass as minimum data to the next stage as possible. Now if this query is often reoccurring why not to have a compound index on all 3 fields? This will speedup the query regardless of doc access or not.

Additionally, the order of the fields in the index matter. We call it Equility Sort and Range order.

So in your case n optimal index will be {field1: 1, ts : 1, field2 : 1}.

Additionally if an index can cover all return data it can be a covered query and speed performance so having all fields can help avoid document scans.

I suggest to read the following material:

Thanks
Pavel

2 Likes

The ts field has highest cardinality for me. That used along with field1 narrows down number of docs to about 1450.

I have such multiple queries, each with different field in place of field2. So I cannot add compound index for all of them.

I am fine with the extra time it might need not being in index.

What I mainly want to know is, where should I add the {$match:{field2:{$gt:0}}}?
Will merging it with first stage cause the entire document to be read for the filtering process or will mongodb just read that single field?
If it reads the entire document from disk for filtering, then I feel it would be better to put the {$match:{field2:{$gt:0}}} after $project stage.

Hi @Dushyant_Bangal,

The optimisation guide states that all possible filtering needs to be done in earliest stage possible.

MongoDB can’t read single fields if they are not indexed. All documents of the first stage are read into memory and passed to the the next stage in memory (excluding sort stage which can use index) .

The cardinality within an index can play a smaller role if the index cannot support the sort. Having ts as the first field will result in a blocking in memory sort.

You can index your documents how you want of course but I would recommend { field1: 1, ts : 1}

Thanks
Pavel

1 Like

Thanks @Pavel_Duchovny. I’ll be going with adding the field in the first stage itself.

Regarding the indexing, I did have { field1: 1, ts : 1} in the beginning, but then faced few issues like this Match regex is not utilizing index correctly - Working with Data - MongoDB Developer Community Forums

Also, with this index, I didnt have to create a separate index on field1, query explain showed mongodb was reusing the existing index with min and max value on ts.

Hi @Dushyant_Bangal,

Sometimes an index will be chosen but its performance will be suboptimal.

Running from min to max is usually undesirable as it means you are doing a full index scan where your purpose was to use index as a filter to reduce amount of docs to be accessed.

A good metrics to look into is numKeysScanned or numDocsScanned vs nreturned.

Thanks
Pavel

1 Like

@Pavel_Duchovny I was trying to keep as less indexes as possible.

My initial indexes were { field1: 1, ts:1 } and { ts:1 }

When I read about cardinality it made a lot of sense to put { ts:1, field1: 1 } instead of { field1: 1, ts:1 }.

My queries are mainly ts range and specific string id of source in field1 OR just ts range. So I figured just { ts:1, field1: 1 } index will do for both.
But now it looks like I should rethink.

I know this is off topic from this thread, but if I go back to { field1: 1, ts:1 } do you think I should add { ts:1 } as well? Or should I first look at the metrics and then decide?

Hi @Dushyant_Bangal

{ field1: 1, ts:1 } and { ts:1 } make sense to cover the 2 described queries.

Its true that we should avoid having redundant indexes and to many indexes to avoid write overhead they cause as well as memory and disk consumption.

However, you should definitely keeps a few indexes that cover your main queries even if they have similar fields.

Thanks
Pavel

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.