One shard still has all the data after equal balancing

Hi,
I have a single instance database with around 550GB+ of data. Most of this data comes from a collection: reports having around 220GB (size value from db.reports.stats()). There are no replicas or shards previously.

Mongo Version: 4.4

The collection contains data for each date for an account.
Fields:

"account_id": ObjectId
"dt": ISODate
...

having index:

"account_id" : "hashed",
"dt" : 1,

I sharded this collection with the below shard-key:

{ "account_id": "hashed", dt: 1 }

between two instances: rs0 and rs1 and chunk size: 128MB

The standalone instance will become the rs0 instance and a new instance is created for rs1

After balancing this is the result of sh.status()

mongos> sh.status()
--- Sharding Status ---
  sharding version: {
  	"_id" : 1,
  	"minCompatibleVersion" : 5,
  	"currentVersion" : 6,
  	"clusterId" : ObjectId("6809d3fbb150ac217e17cdb5")
  }
  shards:
        {  "_id" : "rs0",  "host" : "rs0/ip_1:27017",  "state" : 1 }
        {  "_id" : "rs1",  "host" : "rs1/ip_2:27017",  "state" : 1 }
  active mongoses:
        "4.4.29" : 1
  autosplit:
        Currently enabled: yes
  balancer:
        Currently enabled:  yes
        Currently running:  no
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours:
                No recent migrations
  databases:
        {  "_id" : "my_db",  "primary" : "rs0",  "partitioned" : true,  "version" : {  "uuid" : UUID("678340ef-b4e5-41cd-89cd-3db149934150"),  "lastMod" : 1 } }
                my_db.reports
                        shard key: { "account_id" : "hashed", "dt" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                rs0	15118
                                rs1	15118
                        too many chunks to print, use verbose if you want to force print
        {  "_id" : "config",  "primary" : "config",  "partitioned" : true }
                config.system.sessions
                        shard key: { "_id" : 1 }
                        unique: false
                        balancing: true
                        chunks:
                                rs0	512
                                rs1	512
                        too many chunks to print, use verbose if you want to force print

and response of db.reports.getShardDistribution()

mongos> db.reports.getShardDistribution()

Shard rs1 at rs1/ip_1:27017
 data : 1037.5GiB docs : 723551016 chunks : 15118
 estimated data per chunk : 70.27MiB
 estimated docs per chunk : 47860

Shard rs0 at rs0/ip_2:27017
 data : 1090.36GiB docs : 755327560 chunks : 15118
 estimated data per chunk : 73.85MiB
 estimated docs per chunk : 49962

Totals
 data : 2127.87GiB docs : 1478878576 chunks : 30236
 Shard rs1 contains 48.75% data, 48.92% docs in cluster, avg obj size on shard : 1KiB
 Shard rs0 contains 51.24% data, 51.07% docs in cluster, avg obj size on shard : 1KiB

:exclamation: Not sure why the data size displayed is significantly huge as compared to the volume it occupies

:information_source: Note that the stats say it has almost perfectly balanced the data between the two shards

This is the volume info before sharding:

> df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         16G     0   16G   0% /dev
tmpfs            16G     0   16G   0% /dev/shm
tmpfs            16G  516K   16G   1% /run
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p1  750G  553G  198G  74% /
tmpfs           3.2G     0  3.2G   0% /run/user/1000

Volume info after sharding -
rs0:

> df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         16G     0   16G   0% /dev
tmpfs            16G     0   16G   0% /dev/shm
tmpfs            16G  516K   16G   1% /run
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p1  750G  553G  198G  74% /
tmpfs           3.2G     0  3.2G   0% /run/user/1000

rs1:

> df -h

Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         16G     0   16G   0% /dev
tmpfs            16G     0   16G   0% /dev/shm
tmpfs            16G  520K   16G   1% /run
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p1  750G  205G  546G  28% /
tmpfs           3.2G     0  3.2G   0% /run/user/1000

:information_source: Note that rs0 still has all the data. I did a lookup for an account and it was present on both collection for all dates.

Collection Stats:

{
	"sharded" : true,
	"capped" : false,
	"ns" : "my_db.reports",
	"count" : 1478878576,
	"size" : NumberLong("2284788914827"),
	"storageSize" : 430626484224,
	"totalIndexSize" : 115241549824,
	"totalSize" : 545868034048,
	"avgObjSize" : 1544.6181780538552,
	"maxSize" : NumberLong(0),
	"nindexes" : 5,
	"nchunks" : 30236,
	"shards" : {
		"rs1" : {
			"ns" : "my_db.reports",
			"size" : NumberLong("1114017622641"),
			"count" : 723551016,
			"avgObjSize" : 1539,
			"storageSize" : 140832759808,
			"freeStorageSize" : 380928,
			"capped" : false,
			"totalIndexSize" : 65197338624,
			"totalSize" : 206030098432,
		},
		"rs0" : {
			"ns" : "my_db.reports",
			"size" : NumberLong("1170771292186"),
			"count" : 755327560,
			"avgObjSize" : 1550,
			"storageSize" : 289793724416,
			"freeStorageSize" : 134577819648,
			"capped" : false,
			"totalIndexSize" : 50044211200,
			"totalSize" : 339837935616,
		}
	},
	"ok" : 1
}

Later I searched around and found that it could be due to orphaned data, so i ran this on rs0:

rs0:PRIMARY> db.runCommand({ cleanupOrphaned: "my_db.reports" })

This has very minor affect on the volume occupied by rs0:

> df -h
Filesystem      Size  Used Avail Use% Mounted on
devtmpfs         16G     0   16G   0% /dev
tmpfs            16G     0   16G   0% /dev/shm
tmpfs            16G  532K   16G   1% /run
tmpfs            16G     0   16G   0% /sys/fs/cgroup
/dev/nvme0n1p1  750G  512G  239G  69% /
tmpfs           3.2G     0  3.2G   0% /run/user/1000

:information_source: Note that this is not a production database, there are no queries running on either of the shards. This system completely isolated.

My queries:

  1. The stats say it has balanced the data almost perfectly, but why do I see the volume occupied on rs0 by database is same before and after sharding?
  2. Why does it say that my database has 2127.87GiB docs collectively from both shards, even though the volume occupied is ~500GB?
  3. The balancer was supposed to delete the orphaned records. It has been 3-4 days after sharding completed and still the size hasn’t changed. What went wrong here?
  4. What can I do to make mongo remove the duplicate records from the collection on rs0?

Happy to share more info if needed.
Thanks.

Hi @Sumit_Bhardwaj

Please be aware that version MongoDB 4.4 has reached its end of life (EOL). I highly recommend upgrading to our latest stable release, MongoDB 8.0, in order to benefit from significant performance enhancements and exciting new features introduced over the past few years.

Regarding the behavior you’re observing:

WiredTiger (MongoDB storage engine) can return space to the operating system, but only if the free space is located at the end of the file, allowing for truncation. If free space is within the middle of a file, it cannot be released to the OS. However, this space is not wasted; instead, it can be reused for the collection itself. According to the statistics you provided, the rs0 instance has a free storage size (freeStorageSize) of approximately 135GB. This indicates that 135GB of space is allocated to the collection and is available for reuse.

In order to release this space to the OS you can perform a logical initial sync of the rs0 replica set.

Additionally, in newer versions (6.0.3+) of MongoDB, you can monitor the number of orphaned documents using the aggregation stage $shardedDataDistribution.

2 Likes

In your case @Sumit_Bhardwaj as there is not a replicaSet for the shards and initial sync will not be available. You can try compact on the collection to reclaim the free blocks.

ref: (version specific) https://www.mongodb.com/docs/v4.4/reference/command/compact/

1 Like

Thanks @Renato_Riccio and @chris, your answers helped.
However, I still don’t understand why the shard distribution says rs0 has 1037.5GiB of data when the volume is only 750GB (698.5 GiB)

Hi @Sumit_Bhardwaj

MongoDB uses snappy compression by default with the WiredTiger storage engine.

size is the uncompressed data size
storageSize is the compressed size of the data on disk

ref(versions specific): https://www.mongodb.com/docs/v4.4/core/wiredtiger/#compression

1 Like