Hello, I’m experimenting with MongoDB 6.0 on Atlas, and more specifically with the newest timeseries special collection.
I started with an M10 cluster tier and, using a custom-made Java application, I filled a collection named insight
with 45M (millions) of sample documents. A typical document is the following:
{
"timestamp" : ISODate("2018-02-01T00:00:00.000+0000"),
"meta" : {
"customerId" : ObjectId("437573746f6d657232343639"),
"insightId" : ObjectId("636f6e73756d61626c650000"),
"itemDefinitionId" : ObjectId("4974656d4465663200000000"),
"itemId" : ObjectId("4974656d3100000000000000"),
"locationId" : ObjectId("4c6f636174696f6e37333200"),
"month" : NumberInt(2),
"tenantId" : ObjectId("64656d6f3100000000000000"),
"year" : NumberInt(2018)
},
"_id" : ObjectId("63da7cd53e3dae5a51c21b4c"),
"data" : {
"value" : {
"quantity" : 100.83965259603328,
"cost" : 116.6917902887385
}
}
}
The meta fields are not completely random: an aggregation count like the following returns 1.5M
db.insight.aggregate([
{
$group: {
_id: {
itemId: "$meta.itemId",
itemDefinitionId: "$meta.itemDefinitionId",
customerId: "$meta.customerId",
locationId: "$meta.locationId",
tenantId: "$meta.tenantId",
insightId:"$meta.insightId",
year:"$meta.year",
month:"$meta.month"
}
}
}]).toArray().length
My bulk load procedure on M10 required 10+ hours, I had to restart it twice (each time skipping times already filled) due to my PC suspension. Moreover it also went through two automatic storage resize of Atlas (which worked perfectly). At the end I created a secondary index, with this definition:
{
"meta.tenantId" : 1,
"meta.insightId" : 1,
"meta.customerId" : 1,
"timestamp" : -1
}
The secondary index is crucial for my experiment and I was really surprised to discover its size: 482.7 MB.
I then checked the dbstats of the collection and got the following:
{
"ns" : "test.insight",
"size" : NumberLong(21479128473),
"timeseries" : {
"bucketsNs" : "test.system.buckets.insight",
"bucketCount" : NumberInt(35610000),
"avgBucketSize" : NumberInt(603),
...
},
"storageSize" : NumberLong(3451998208),
...
"indexSizes" : {
"meta.tenantId_1_meta.insightId_1_meta.customerId_1_timestamp_-1" : NumberInt(506097664)
},
...
}
In summary the storage size is 3.2 GB, whereas the size is 20 GB. And again, the index is 482.7 MB. Plus, the bucket count is 35.6 millions.
I repeated my procedure, this time on M40 with no storage resize required. After 3 hours, without interruption, the same data appeared on the collection, but with completely different (and better) internals.
Not sure about the repeatability of my load procedure I tried a very different approach: using 3T Studio I selected Duplicate Collection and got a copy of the very first collection (the one with storage size 3.2 GB, size 20 GB, and index 482.7 MB). The dbstats confirmed the info I got from my 2nd attempt:
{
"ns" : "test.copy_of_insight",
"size" : NumberLong(2691175332),
"timeseries" : {
"bucketsNs" : "test.system.buckets.copy_of_insight",
"bucketCount" : NumberInt(2376681),
"avgBucketSize" : NumberInt(1132),
...
},
"storageSize" : NumberInt(1077383168),
...
"indexSizes" : {
"meta.tenantId_1_meta.insightId_1_meta.customerId_1_timestamp_-1" : NumberInt(34258944)
},
...
}
The rewritten collection has size 2.5 GB, storage size 1 GB and index size 32.7 MB. And the bucket count is 2.3 millions.
I can provide a snapshot of the database, with both the collections (they are in my Atlas account) for further examination, there is no sensitive data in it.
I wonder if I hit, during my first load, some kind of non-optimized (or bugged) code related to the bucket creation. I suspect that 35 millions of buckets lead to a 400+MB of index size and then the 3T Studio Duplicate Collection fully rewriting the data optimized the bucket distribution, lowering them to 2.3 and thus to a 32MB of index size.
If you are aware that such situation of “unoptimized buckets” could happen in certain situations (such as storage resize in Atlas) is it possibile to see in a future version of MongoDB an admin command to rebalance the buckets ?
Please take into account that a 400+MB of index size (vs an optimal 32MB) has serious impact on the cluster tier on Atlas, if you plan to have the index in memory. Not mentioning how differently the 2 collections performed on the same aggregation (using the secondary index) due to how the data are differently distributed between 35 millions of buckets vs 2.3 millions of buckets.