Suspect that $exists is causing a memory leak

MaBeuLux88_xxx · February 23, 2022, 10:39pm

Let’s focus on this query {'meta': {'$exists': false}, 'POS': {'$nin': [10438, 10440, 10441]}}.

Let’s imagine that you have 1 billion (1B) documents in your collection. 1 of them has a meta field (only one).
80% of the docs have a POS in [10438, 10440, 10441].

Given the distribution of the data, having an index on meta is either the most useful or the most useless index on the planet - depending on the query. If you are running for meta exists => Super efficient. If you are looking for not exists => Useless as you basically take ALL the entries but one from the index => not having an index would probably save time at this point.

Now with an index on POS. You want the values NOT in 10438, 10440, 10441. You can use the index on POS and discord 80% of the database. At this point, mongod fetches the docs on disk and brings them in the cache (RAM) and now have access to all the fields in the docs. It resolves the end of the query: “is there a meta field? => if yes remove from the list”. At this point you return 20% * 1B docs = 200M docs +/- 1 document (the meta one).

Using a compound index {POS:1, meta:1} would avoid potentially fetching that one document and then removing it from the list of docs to return. But this would double the size of the index, so it’s clearly not worth it. If 50% of the docs add a meta field, this would be another story. But given the uneven distribution that we have here (1 in 1B), it’s not worth it.

So the correct and optimised index for this query is {POS: 1}, given that distribution and this query.

Note that in this example, we are still returning about 20% of 1B docs which is about 20GB so that’s 4GB of docs uploaded from the DB => client and all those docs are moved from disk => RAM in that process (working set).

Cheers,
Maxime.