Index is slower when using $nin

Sagume_Kishin · June 25, 2021, 2:47pm

I have a collection with ~67k documents each has a tags fields which is an array of integers (both int and long), I want to query docs excluding a certain tag say 370, so naturally I did {‘tags’: {’$nin’: [370]}}
About 75% of documents contains tag 370.
I am using pymongo.
When index is NOT created on tags field, the query limiting to top 20 documents took ~200ms to complete with the first stage doing COLLSCAN took 107ms (67k doc examined, 17k returned) and second LIMIT stage took 0ms.
When index is created on tags field, the query limiting to top 20 documents took ~4s to complete with the first stage doing FETCH took 1962ms (67k doc examined, 17k returned, 3m keys examined) and second LIMIT stage took 3ms.
I know queries like $not and $nin are not selective so they are slow and creating index may not help, but I didn’t expect creating index actually degrades performance.
Currently I have to leave it unindexed since most of user’s queries use $nin.
Why in this case index hurts performance? Is there a plan to fix this?

Edit:
MongoDB server version: 4.4.6

Takis · June 26, 2021, 1:55am

Hello and welcome @Sagume_Kishin

Array indexes are big because they store all members of the array in the index.
Looks like your index has 3 million keys.

If low selectivity and we have an index also,we will pay the cost of searching in the
index(here a big index) + the cost of fetching those documents.

That can be bigger from the COLLSCAN

Here its possible that its not only the index size,but how the information is used.
Because even if a key in the index is not equal with 370,this doesn’t mean that the document
fits our needs.We need to know that none of the keys that points to the document is 370,to say
that we have a match.So maybe the combination of $nin with your array field makes it much worse.

Maybe someone else can help more,on how the index is searched with the $nin and why it takes so long.

Sagume_Kishin · June 26, 2021, 10:04pm

I replaced find, sort and limit in pymongo with an aggregation pipeline, now it takes 1.8s to complete a query down from 4s if index is created, however it is still longer than no index, which takes 180ms. I wonder if I can tell it to use COLLSCAN instead of FETCH. Or can the query planner be smart enough to figure out that using COLLSACN is actually faster than FETCH.

Edit:
I solve this by using {hint: {’$natural’: 1}} to force COLLSACN when only $nin query is involved.