Why database size increaes with number of shards?

It is observed that a collection size increase as a number of shards is added to the cluster. For example, the collection size on 3 sharded clusters is more the same collection data on 2 sharded clusters. Explain How?

I do not know the magnitude of what you have observed but some might be simply explain by basic overhead. From

we see that (embolden is from link above and italic is my comment):

Each collection has a certain minimum overhead of a few kilobytes.

On 2 shards cluster you have this overhead twice per sharded collections and 3 times for 3 shards.

Each index, including the index on _id , requires at least 8 kB of data space.

Same as above. The overhead is 2x in 2 shards and 3x in 3 shards.

I am sure there are other reasons but I see that as starting to explain what you observed. To conclude more shards implies more overhead.

Hi @Prof_Monika_Shah,

What command are you using to measure collection size? Are you referring to the size on disk or the size of data in the collection? How much of a difference are you observing?

The overall size of data in the collection should be the same, but distributed across multiple shards (where collections are sharded).

Disk usage on donor shards may increase temporarily as data is rebalanced to the new shards. It is expected that there will be more space available for reuse, which will be used as new data is added to the sharded cluster.


I think db.collection.stats().size is the size of data in collection, and db.collection.stats().storageSize is the size of data in disk. Am I correct?

Web reference mentioned in your previous reply says each collection has minimum overhead, which include indexsize for _id field.

But, sir I have checked db.coll.stats().storageSize which exclude indexSize.
And this storageSize becomes larger with number of shards.

I do not understand which overhead make collection size increase

Per documentation,

If you have a simple replica set, you have 1 time that overhead per collection.

If you have 2 shards, then you have 2 times that overhead per collection.

When you have 3 shards, them you have 3 times that overhead per collection.

Create an empty collection and look at storageSize, this is the overhead per collection. You get that overhead per shard because each shard has one instance of the given collection.

Hi @Prof_Monika_Shah,

Did you check the amount of space available for reuse per my earlier comment?