Hello,
we have a MongoDB replica set (version 5.0.22) and we’ve recently run into a peculiar situation in which a 5 field compound index is half the size of a 3 field compound index (I’ll call them IDX_5 and IDX_3). The first three fields of IDX_5 are the same fields of IDX_3. Basically IDX_3 is contained in IDX_5.
I’ll give more context (I’ll be writing quite a few things so bear with me):
The collection has 1.4 billion documents and has a size of 3200 GB and storageSize of 800 GB. We first created IDX_3 consisting of the following fields:
- field_1: string field with low cardinality, used to identify a user’s documents
- field_2: int field with really low cardinality
- field_3: date field. It has high cardinality but not that high, a lot of documents are modified in batches. For example I took a sample size of 100 milion docs and saw that there were “only” 270 thousand unique date values.
We then created IDX_5 which consists of the same 3 fields of IDX_3 (in the same order) and of two other fields:
- field_4: int field with extremely low cardinality. Distinct values are 0, 1 or non existent.
- field_5: int field with really low cardinality
I’ll spare you the detail as to the reason we created it like this. Not super optimized I know, but good enough for our particular case.
I should also add that the number of unique values does not change even with the addition of the extra two fields.
We expected IDX_5 to be if not bigger then at least the same size as IDX_3, instead it’s half the size (7.6 GB vs 14.7 GB)
Both are normal indexes. IDX_5 is not partial. I compared the creationString using collection.stats with indexDetails and they’re the same.
We also took into consideration the fact that it could be a matter of fragmentation or that maybe there were more writes during the creation of IDX_3, and so maybe it was built in a less optimized manner. But I can say with confidence that it is not the case. We performed an initial sync on our hidden node and when it was done the size was the same. We also created two other indexes to see if the pattern repeated and it did. A 3 field index and a 5 field index, differing only by the first field. Instead of field_1 another string field was chosen (similar to field_1, also with low cardinality).
We also did another test to see how much space the index occupied in the cache. We scanned the entirety of both indexes with two queries and looked at the ‘bytes currently in the cache’ value of the indexes before and after the queries.
IDX_3:
"storage": {
"data": {
"bytesRead": 14261006197,
"timeReadingMicros": 22577675
}
},
"durationMillis": 1342402
26 GB in the cache.
IDX_5:
"storage": {
"data": {
"bytesRead": 7369959116,
"timeReadingMicros": 11922711
}
},
"durationMillis": 1555271
20 GB in the cache.
(We did this on our hidden node restarting it before every query, also checking that there were no evictions)
Could someone with a deeper knowledge of wiredTiger satisfy our curiosity and explain to us what is going on under the hood?
Thank you very much.