Clustered collections dont use index

Hello everyone!

Testing clustered collections, I’ve cheked that the find queries using $in operator dont scan the clustered index, scan the entire collection. Reading the documentation seems correct:

Faster queries on clustered collections without needing a secondary index, such as queries with range scans and equality comparisons on the clustered index key

Someone knows if this improvement is coming in the next releases? I think that this functionalty would be fantastic!

greetings!!

Hi @juan_lluesma welcome to the community!

I believe you have read the main documentation page of Clustered Collections, and I think the answer to your questions are somewhat mentioned in that page.

This is actually alluded to in the paragraph you have quoted: clustered collection is best when you have equality or range queries. A $in query is not really either (more like multiple equality instead of a single equality), so it doesn’t benefit from a clustered index.

This behaviour makes sense when you consider the following point from the doc page:

A non-clustered collection stores the _id index separately from the documents.

This is the “normal” _id index, and:

A clustered collection stores the index and the documents together in _id value order.

So a clustered collection stores the _id index together with the document.

Thus when you have a query on _id that is a single equality, it’s relatively straightforward for a clustered index to zoom into the target document’s location. It doesn’t need to scan a separate _id index then fetch the related document, it can scan and fetch at the same time since the key and the document are stored together. Similar situation will be observed for a range query. However the keyword here is “scan”. Since the index and the document are stored together, scanning is pretty much needed.

This is mentioned in the doc page as well:

When a query uses a clustered index, it will perform a bounded collection scan.

What “bounded collection scan” means is that it’s basically a scan, but it’s typically not the whole collection since we know where the relevant _id are.

However, note that you can create secondary indexes that sort of make a clustered collection behave somewhat like a normal collection, so if you have a freeform query using other fields, presumably you can use a clustered collection as well.

I think in short a clustered collection is beneficial if your queries are linear for the _id field. If you need a more freeform queries, then a regular collection may offer more flexibility. Which collection type is best depends on your use case; you might want to test clustered vs. non-clustered collections under a test workload and see which one comes out on top.

Hope that make sense!

Best regards
Kevin

2 Likes

I have a query on a clustered index like this:

{ _id: { $in: [6660057786, 6557989767] } }

Why would this not result in two equality checks, but a full collection scan?

Hi @hk86 welcome to the community

This is because a $in query is an equality query, as I have mentioned in the above post:

Hope this clears things out.

Best regards
Kevin