Why two fields in same collection with same data type but different cardiality has different index size

Monika_Shah · December 26, 2023, 5:25pm

Documents in a collection have two mandatory fields F1, F2 of same data type , but different cardinality(distinct values). Default B-Tree index is created on F1 and F2 fields. Index size is identified using db.collection.stats().indexSizes

Question is : Why two fields on same collection with same data type have different index size?

steevej · December 26, 2023, 7:28pm

Because the have

Grosso Modo, one value is one node in the tree. An higher cardinality implies more nodes.

Monika_Shah · January 4, 2024, 5:12am

I also thought higher cardinality cause more intermediate node and hence more size. But, some examples do not justify it. For example, F1 field with lower cardinality and F2 with higher cardinality in same collection . But, Index size of F1 field is higher than F2.

On other side,
Two different collection coll1 and coll2 have same number of documents and each document in both collection have same field (Rid) with unique (same cardinality) and same value , but document size is different in coll1 and coll2. Even after same cardinality and same value of document in different collection shows different index size. Why?

steevej · January 4, 2024, 3:29pm

For case 1, I can imagine some cases where a low cardinality field index would be bigger than a high cardinality field. An example would be

{ f1: 'this is low cardinality but long key value', f2: 2 }

Each index node will require more space with f1 since the values of f1 are way smaller than f2. So it is possible that despite having less node in f1’s index, it does takes more space.

As for case 2, I am surprised. Same cardinality, same key type, same index definition and same update pattern should provide identical index size despite document size. What I suspect, is that one collection had a lot of insert/delete. If the index space is not necessarily released after delete. The other collection did not go through the same cycle of insert/delete. I could create such a scenario by playing around with c1 and then aggregate c1 with an $out to c2. Exactly the same documents, but index of c1 is bigger than c2.

Monika_Shah · January 5, 2024, 8:02am

For case 1, low cardinality field has value 1 to 10^5, which is obvious lower than high cardinality field with value 1 to 10^7. Even though, Index size of low cardinality field is more than high cardinality field .

Second case, No delete operations is performed. But, yes insert operations are performed that do not required to release space as per my knowledge.

May I know Index structure used by the database? Does it use B±tree type index, where leaf node contains key value+ document pointer for each record irrespective of duplicate or unique value and non-leaf node contains key value for unique key values and other node pointer? does db.collection.stats().indexSizes represent compressed index sizes ?

steevej · January 5, 2024, 2:30pm

I am sure you may find documentation about the implementation details. I don’t know the exact details and I do not worry that much. I cannot help further on this matter.

Monika_Shah · January 7, 2024, 6:32am

Thank you steeve for your help till now.
Just to inform that exception scenario of two fields in different collection but with same cardinality, same key type, same defintion, and same update pattern with different index size. : It is default _id field of two collections with same number of documents but document size is different.

Does MongoDB use B±Tree index structure as default for single field index? for both primary and secondary key

steevej · January 7, 2024, 2:16pm

With google’s help, it looks like it does use B-Tree

system · January 15, 2024, 5:50pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.