Shard sizes on disk are imbalanced

Oded_Raiches · February 27, 2023, 5:06pm

Hello,
Been using MongoDB 6.0.4 with Ranged-Sharding.
Noticed that the storage attached to those shards is imbalanced, which became a concern in terms of storage planning.

In this case, we have a 3 shards cluster which in terms of shard size looks balanced but the actual size of the data on disk is very different, be it because of compress/dedup - when our system sees a disk is going to be fully used it will automatically start a new shard with a new replica set, disks and everything, even that the other shards may be 50% used in capacity.

Tried setting the chunkSize to be 128MB on a different system but didn’t see it has any effects, the chunks seem to go beyond that (right now past 1GB), so not sure how I can use that to solve the issue, also saw this thread about it: Chunk size many times bigger than configure chunksize (128 MB)

This is the first system with 3 shards (without any chunkSize restriction), each volume here has 512GB size:

shard1:

node1: volume11: 317GB
node2: volume12: 326GB
node3: volume13: 325GB

shard2:

node1: volume21: 258GB
node2: volume22: 258GB
node3: volume23: 258GB

shard3:

node1: volume31: 186GB
node2: volume32: 194GB
node3: volume33: 194GB

[direct: mongos] zios> db.collection_name.getShardDistribution()
Shard shard3 at shard3/***IPs***
{
  data: '428.79GiB',
  docs: 522351318,
  chunks: 4752,
  'estimated data per chunk': '92.4MiB',
  'estimated docs per chunk': 109922
}

Shard shard1 at shard1/***IPs***
{
  data: '429.1GiB',
  docs: 300330555,
  chunks: 975,
  'estimated data per chunk': '450.67MiB',
  'estimated docs per chunk': 308031
}

Shard shard2 at shard2/***IPs***
{
  data: '428.68GiB',
  docs: 290760604,
  chunks: 2720,
  'estimated data per chunk': '161.38MiB',
  'estimated docs per chunk': 106897
}

Totals
{
  data: '1286.58GiB',
  docs: 1113442477,
  chunks: 8447,
  'Shard shard3': [
    '33.32 % data',
    '46.91 % docs in cluster',
    '881B avg obj size on shard'
  ],
  'Shard shard1': [
    '33.35 % data',
    '26.97 % docs in cluster',
    '1KiB avg obj size on shard'
  ],
  'Shard shard2': [
    '33.31 % data',
    '26.11 % docs in cluster',
    '1KiB avg obj size on shard'
  ]
}

As you can see, the distribution between the shards looks ok, but the final disk usage is not balanced at all.
Is there any suggestion we can follow to balance the disk capacities better?

Thanks for your support.

Aasawari · March 2, 2023, 4:13am

Hi @Oded_Raiches and welcome to the MongoDB community forum!!

As mentioned in the documentation, staring from MongoDB Version 6.0.3, the shards are balanced based on the data size rather than the chunk size.

For further understanding the issue, could you confirm, if the documents inside the collection have similar sizes?
I tried to reproduce the above in a local sharded deployment with 3 shards and 5GB of data and I see as similar distribution between the shards because the documents in the collection have inconsistent sizes (i.e. some are much larger than others).

Please provide us with the below details for more clear understanding

A sample document and output of db.<collection name>.stats() of the collection.
The shard key selected and reason for selecting the shard key.
What type of shard key is created.

Let us know if you have any further queries .

Best Regards
Aasawari

Oded_Raiches · March 2, 2023, 11:21am

Hi @Aasawari , thanks for the reply!

We do have different documents sizes, but in general they are small in size.
The shard key is the _id and has to be this way since we use the prefixes to do do listing of certain groups of documents based on these prefixes (using regex).
Main 2 document examples:

{
	_id: 'key1,prefix1/prefix2/folder/5b46247f-f797-45cf-9a61-35a62542c52d/b70a9a7c-1597-4d8b-a7ba-eb05ba9849ed/blocks/98f769b32035439d4345c8916c7a1d56/240075.b9875d2a27e9344a6031f6378af91951.00000000000000000000000000000000.blk',
	about 20 additional keys, document length about 2K bytes.
	these "large" document are less frequent.
}

{
	_id: '+prefix0+key1,+prefix1/prefix2/folder/5b46247f-f797-45cf-9a61-35a62542c52d/0d52826b-a804-4699-89bc-79d935f0af8e/blocks/c7cee0f0df504aa031b9300f5fa2f93d/24788.f7124aa6c880f4be9c037a759df46736.00000000000000000000000000000000.blk+8323786472.99638',
	about 15 additional keys, document length about 700 bytes.
	these documents are more frequent.
}

db.<collection name>.stats() call (to long for the reply, had to add as a file):

collection_stats.txt (65.8 KB)

Thanks!

Oded_Raiches · March 6, 2023, 7:53am

Hi @Aasawari !
Is there anything new with regards to the info provided?

kevinadi · March 9, 2023, 6:14am

Hi @Oded_Raiches

I believe here is the relevant snippet from the collection stats. Note that I rearranged this a little to have the shards in sequence, and I also added annotations to the sizes to make it easier to read:

  shards: {
    shard1: {
      ns: '<db_name>.<collection_name>',
      size: Long("429786275912"), ~429.79 GB
      count: 258306027,
      avgObjSize: 1663,
      numOrphanDocs: 0,
      storageSize: Long("218921512960"), ~218.92 GB
      freeStorageSize: Long("122932756480"), ~122.93 GB
      ...
    shard2: {
      ns: '<db_name>.<collection_name>',
      size: Long("429519998496"), ~429.52 GB
      count: 293937414,
      avgObjSize: 1461,
      numOrphanDocs: 0,
      storageSize: Long("143384731648"), ~143.38 GB
      freeStorageSize: Long("59074117632"), ~59.074 GB 
      ...
    shard3: {
      ns: '<db_name>.<collection_name>',
      size: Long("429381962949"), ~429.38 GB
      count: 486677674,
      avgObjSize: 882,
      numOrphanDocs: 0,
      storageSize: Long("103618985984"), ~103.62 GB
      freeStorageSize: Long("6423724032"), ~6.42 GB 
      ...

What I observed is that the size (uncompressed size) are quite similar across the three shards.

However the avgObjSize and the storageSize are telling a different story here:

shard1 has the largest avgObjSize, 2x as large as shard3, and it has half as many documents as shard3
shard2 also have large objects, much larger than shard3 but not as large as shard1
shard3 have the smallest objects, and it has twice as many documents as shard1
There are large numbers of freeStorageSize, I’m guessing because the workload also involve deleting/updating a large number of documents

Although the uncompressed size is balanced, the compressed size are not balanced across the shards. You mentioned that you have larger documents and smaller documents in general, but they seem to be concentrated on certain shards instead of evenly balanced across all the shards. This leads to unbalanced disk use in general.

I think it may be caused by the shard key. The best shard key should spread the workload evenly across the shards, but that doesn’t seem to be the case here. See Uneven Load Distribution for more details.

To mitigate this, I would consider picking a shard key that allow for a more balanced distribution of large and small documents. Note that from MongoDB 5.0, you can reshard a collection.

Best regards
Kevin

Oded_Raiches · March 14, 2023, 11:29am

Hi @kevinadi , don’t think the shard key is the issue.
I tried out compact , and a disk with an ~80% used capacity went down to ~40%.
Seems that the DB is leaving the disk fragmented, why is defrag not ran from time to time when needed? is there a way to know how much of a the data is fragmented and if compact run is needed?

kevinadi · March 15, 2023, 2:32am

This is sort of alluded to in the earlier reply I did:

What I forgot to explain is that: WiredTiger is a no-overwrite storage engine. When a document is deleted, it was marked as deleted, the space marked as reusable (freeStorageSize). This is because:

It’s assumed that your database will have more data in the future, not less. So the disk space will eventually be reused.
Defrag (compact) is an expensive and disruptive command. Its main purpose is to defrag the collection file, and release unused space to the OS (no guarantees though). If your database will have more data in the future, that means that it will need to reallocate that space again in the future, rendering the time & effort used by compact to be moot.

Of course this is a general assumption and may not be true in all cases. If you find that you won’t need to reuse that space, then compacting it is the right thing to do.

Hope this clears things up.

Best regards
Kevin

Oded_Raiches · March 16, 2023, 6:03pm

thanks! the info helped to build my solution