Sharded Cluster - Why does my data not splitted among nodes?

furkan_yilmaz · April 1, 2024, 8:51am

Hello there. I created a cluster with two shards. Or so I thought. In my design, first server has mongos_router shard1 (3 nodes) and config servers. Second server only has shard2 (3 nodes as well). After I configured shards I enabled sharding for every database. And when I look the output of sh.status() I see the output below:

...
  {
    database: {
      _id: 'wins_emission',
      primary: 'shard1rs',
      partitioned: false,
      version: {
        uuid: UUID('aebf94cf-6069-41ba-9a91-f91a944071b1'),
        timestamp: Timestamp({ t: 1711952615, i: 3000 }),
        lastMod: 1
      }
    },
    collections: {}
  },
  {
    database: {
      _id: 'wins_healthcheck',
      primary: 'shard2rs',
      partitioned: false,
      version: {
        uuid: UUID('663cb5f7-b7b3-4f40-9f52-2c3d1969fb65'),
        timestamp: Timestamp({ t: 1711952305, i: 4 }),
        lastMod: 1
      }
    },
...

I understood this as db’s will be distributed among shards. And I was expecting the data among nodes not to be same. For example, notifications table have 17.7k docs. And Im expecting these docs to be shared among nodes. Like shard1 first node have 4k, shard1 second node have 4k etc. But it isnt working like that. Every node in every shard has the same amount 17.7k .

I may be misunderstood this. I tried sharding at collection level for notifications table. I created a hashed shard key. And then executed sh.shardCollection(xxx) command. And now my first shard has 4.7k docs among its own nodes whereas shard2 has 12.9k in itself. Now this made me think of these questions.

1- Do I need to shard every collection to make use of a sharded cluster?
2- Should I shard every collection or only the ones which will hold big data like logs.
3- Why do all the nodes in a shard have the same amount of docs. Aren’t they supposed to distribute the data in itself?

Any help is appreciated.

Vishal_Alhat · April 16, 2024, 9:49am

The uneven distribution (4.7k vs. 12.9k) after sharding the “notifications” collection with a hashed shard key likely indicates a skewed data distribution within the collection itself. For example, if the shard key is based on a field that has a few distinct values, documents might end up heavily concentrated on one shard.

You’re correct, that sharding doesn’t automatically distribute data within a shard by default. It relies on the shard key you define for a collection. This key determines how documents are hashed and assigned to specific shards in your cluster. Without a shard key, all documents in an unsharded collection reside on the primary shard (shard1rs in your case).

You don’t necessarily need to shard every collection. It’s most beneficial for collections with Large datasets that exceed the capacity of a single shard, OR
where there is requirement of Frequent read/write operations on specific data subsets based on the shard key.

I would recommend to regularly monitor shard distribution using sh.status() and adjust shard keys or add more shards if needed.

Hope this helps, Happy sharding!