Why compound wildcard index still do fetch?

Luke_Thompson · October 13, 2024, 11:58pm

I should preface this with saying I have no inside knowledge of how all this works, so I’m making a few guesses here.

I think one problem is that a compound index is a b+ tree which includes all the fields, in order and you are omitting the hour, appid and os from the $match.

Another could be the size of the compound index.

The query planner never decides to add for you a match on {hour: {$exists: true}} in order to walk the tree down every branch of hour because:
a) this may be more expensive than a document fetch and applying the filters there
b) in a large compound index with many hundreds of thousands of documents, and potentially millions of index entries this may take up a huge amount of memory

In my experience, indexes will be utilized only when every field in the index (or a partial start set of fields in index order) is included in the match.

I don’t know the full internal workings of the query planner and search code, but I believe that in the case of a very large compound index like the one you are building, MongoDB is trying to manage its memory and figuring out if all or part of the index can be loaded into RAM. It’s also looking at the query and whether the index will serve the query well.

A couple of questions I would ask are:
a) how big is this index?
b) how much memory does the mongod server have?
c) does the match statement require all parts of the index, in the order of the compound fields?

The answers to these might cause different behaviour from the query planner.

When it comes to (c), there is perhaps a benefit to duplicating the country field outside the attributes array to:

simplify the query
reduce the memory required for an index on country
simplify the job the query planner has

So if you duplicated the country field outside the attributes array so the document had:
{
year: 2024,
month: 10,
day: 1,
hour: 0,
country: ‘AU’,
attributes: [
{k: ‘country’, v: ‘AU’},
…
]
}

And added the index:
{
year: 1, month: 1, day: 1, country: 1
}

Then I think you’d get a nice covered query with no fetch, as well as a greater chance the index would be fully loaded into RAM and stay hot in RAM ongoing.

I noticed in the past that the queries involving nested object array queries were:

complex
often resulted in document fetch stages where I thought it would have scanned the index
e.g. if you try to match on two or more attributes, the query planner will only ever look at one

Of course if you have many different queries with different combinations of these fields this might be problematic. In that case I’d consider switching to Atlas Search. It can perform multiple index intersection over simple field indexes - so you could have an index over year, another over month etc.