Mongodb $in implementation and complexity

Louis_Christopher · April 9, 2021, 12:28pm

Originally posted on StackOverflow, thought I’d try my luck here as well.
Here is our document schema

{
  name: String
}

Here is our query

{
  name: {$in: ["Jack", "Tom"]}
}

I believe even if there isn’t an index on name, the query engine will turn the array in the $in into a hashset and then check for presence as it scans through each record or a COLSCAN. It will never do a naive O(m*n) search, right?

I’m trying to find supporting documentation online but I’ve come up short. I’ve tried searching in the source code but I can’t seem to find the exact section responsible for this either.

If the index exists I believe that it will use it directly instead and be faster. If I’m not wrong I think it will be O(m*log(n)) as it gets the result set in log(n) time from the b-tree for every element in the $in array and returns the union of them all. Though big Oh wise for large m it seems slower than the O(n) hashset approach, its faster in practice as the disk reads are much more expensive.

Is this line of thinking correct when there is an index?

And if there isn’t an index does it do the COLSCAN with a naive search or will it use a hashset to fasten the process.

steevej · April 9, 2021, 4:27pm

If there is no index on the queried field there will be a collection scan which is a slow O(n) if the collection is not in RAM. The O(n^2) complexity is for the most naive sort, not for a search, You might mean O(mn) where m is the size of the $in array and n the size of the collection.

Louis_Christopher · April 9, 2021, 4:37pm

Yes, thanks, I mean the O(mn) is the naive search, you can convert the array into a hashmap/hashset and then check if it’s present in the hashmap or not as it performs the COLSCAN which will be O(n).
Does it do this though when there is no index? Updated my question to clarify.