Hi @juan_lluesma welcome to the community!
I believe you have read the main documentation page of Clustered Collections, and I think the answer to your questions are somewhat mentioned in that page.
This is actually alluded to in the paragraph you have quoted: clustered collection is best when you have equality or range queries. A $in
query is not really either (more like multiple equality instead of a single equality), so it doesn’t benefit from a clustered index.
This behaviour makes sense when you consider the following point from the doc page:
A non-clustered collection stores the _id
index separately from the documents.
This is the “normal” _id
index, and:
A clustered collection stores the index and the documents together in _id
value order.
So a clustered collection stores the _id
index together with the document.
Thus when you have a query on _id
that is a single equality, it’s relatively straightforward for a clustered index to zoom into the target document’s location. It doesn’t need to scan a separate _id
index then fetch the related document, it can scan and fetch at the same time since the key and the document are stored together. Similar situation will be observed for a range query. However the keyword here is “scan”. Since the index and the document are stored together, scanning is pretty much needed.
This is mentioned in the doc page as well:
When a query uses a clustered index, it will perform a bounded collection scan.
What “bounded collection scan” means is that it’s basically a scan, but it’s typically not the whole collection since we know where the relevant _id
are.
However, note that you can create secondary indexes that sort of make a clustered collection behave somewhat like a normal collection, so if you have a freeform query using other fields, presumably you can use a clustered collection as well.
I think in short a clustered collection is beneficial if your queries are linear for the _id
field. If you need a more freeform queries, then a regular collection may offer more flexibility. Which collection type is best depends on your use case; you might want to test clustered vs. non-clustered collections under a test workload and see which one comes out on top.
Hope that make sense!
Best regards
Kevin