Best practice on using limit() with $in over indexed field

Jean-Paul_Saade · August 19, 2022, 10:15am

Hello community!
We’re having a debate on which approach is more performant regarding querying over a unique indexed field using $in operator.
Which is more performant?

db.collection.find({_id: {$in: [...arrayOf1000000Items]}})

or

db.collection.find({_id: {$in: [...arrayOf1000000Items]}}).limit(1000000)

The explain plan displays a new LIMIT stage (besides IXSCAN FETCH), but performance difference was very nullable as we tried the comparison over a smaller set of documents.

Another question would be:
What would happen if we introduce sorting to the query? Would limit be recommended here?

steevej · August 19, 2022, 4:19pm

The limit will not make a difference because _id has a unique index so you cannot have more elements than what you have in your arrayOf1000000Items. If he limit would be smaller that the array then it is another story.

If you sort on _id, it should not make a difference because the field is indexed. If you sort on another field that is not indexed you will experiment much more latency as the sort will be in memory.

Passing an array of 1_000_000 elements is probably very slow by it self, I would encourage you to find another way of specifying your query. While it is not advised to do multiple round-trip to the database to implement a use-case, may be this one is an exception. One query to find the 1_000_000’th id and then query with $lt and $gt to get the 1_000_000 documents.

The question is what criteria is used to determine the documents to put in the array at first. May be this criteria can be the query.

The other question is what ever you do with your 1_000_000 documents, can you do it with the aggregation framework instead?

Jean-Paul_Saade · August 20, 2022, 3:20pm

For sure 1_000_000 elements in an array is very bad in terms of performance, what I wanted to know is that using limit(1_000_000) along this filter would cause in slower performance than without it.
My given example might be irrealistic, usually our filter arrays would contain a maximum of 20 elements.

So long story short, since adding limit() would add an extra step “LIMIT” to the EXPLAIN plan, it nonetheless doesn’t affect performance, but better not use it, right?

Doug_Duncan · August 20, 2022, 3:50pm

In this case, you don’t need to add a limit because the _id field is guaranteed to be unique. This means you will never have more results than you have in the $in array of elements.

Jean-Paul_Saade · August 22, 2022, 1:40pm

Very much appreciated, thanks everyone!

system · August 27, 2022, 1:40pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.