Why is disk usage too high when migrating from 4.2 to 4.4?

I am trying to move the collection data from replica set to shard cluster.
In the shard cluster, we split shard into empty collections that generated the same index.

After that, I am trying to load with json file using mongoimport command.

The following is our disk usage of before and after their migration value.
storage size
162G → 225G
index size
_id Index: 53G → 94G
Single Field Index: 22G, 24G, 23G → 67G, 68G, 65G

Disk usage seems to double after migration.
What is the cause of this problem?

Hi @111115 and welcome in the MongoDB Community :muscle: !

Could you please explain the architecture of your replica set before the migration and the new architecture of the sharded cluster?
Usually, sharding is overkill before 1TB unless there is another good reason for sharding (planning to reach 1TB+ soon, geo distribution, data sovereignty, etc).
Also, how did you distribute your nodes on your physical machines?

Cheers,
Maxime.

Hi!
Thank you for your prompt reply.

Before migration, the architecture was physically using one primary, secondary, and arbiter server.

As you said, our existing DB server was using more than 1TB, so we decided to build a new shard cluster.

The architecture of the new shard cluster consists of three servers.

Currently, there are only a limited number of servers available, so we have configured it like this.
In the future, we are planning to increase the config, mongos, and shard servers.

Of course, I thought it would be too much to migrate all the data from the current server to the new server.
So, until a new server was added, only a few DBs were going to be migrated.

However, as I explained in the first question,
disk usage increased rapidly during the migration by sharding by collection,
so I checked the collection on Mongos server with stats command.

We expected to have more storage capacity because of the _id hashed index.
And it was right.
But unfortunately, Additionally,
it seems to cost more than twice the storage size and index size
compared to the same number of documents per collection.

Hey,

So… I don’t know where to begin because there are so many things to say here and this raises so many new questions.

I don’t have a clear explanation for the increase of sizes you noticed to be honest but many things can explain a part of it…

  • Did you use the same MongoDB versions in the former RS and in the sharded cluster?
  • Did you use the same configuration ? Not using a different WiredTiger compression algorithm by any chance?
  • Are these stats just about a single collection?
  • Does that collection contains exactly the same data?
  • How did you gather these stats?

Also one thing that I don’t understand in what you are saying:

One collection in a RS == one collection in a sharded cluster. If the collection isn’t shared, it will live in its associated primary shard. If it’s sharded, it will be distributed across the different shards according to the shard key.

Another thing that might not be obvious but Sharded Clusters come with an overhead. They require more computing power and more machines. If you only have 3 machines available, then it’s more efficient to run a RS on them rather than deploying a sharded cluster & sharing the machines to host multiple nodes on the same machines. It’s also an anti pattern because sharded clusters are designed to be Single Point of Failure safe and here every single one of your 3 machines is a SPOF that will take the entire cluster down with them if one of them fails. This architecture is OK for a small dev environment, but definitely not ready for a production environment.

This doesn’t make sense to me. Why would an index give you more storage capacity? It’s the opposite. An index costs RAM & storage.

Also choosing the right shard key is SUPER important.

image

A hashed index on the _id is the VERY LAST trick in the book you can use to create a shard key. 99% of the time, there is a better solution and a better shard key that will improve read performances dramatically and still distribute the write operations evenly across the different shards.
A good shard key should optimize as much as possible your read operations by limiting the reads to a single shard and distribute the writes evenly across all the shards. The shard key must not increase (or decrease) monotonically and the rule “data that is access together must be stored together” also applies more than ever. Ideally, data that you will query “together” should be in the same chunk as much as possible. Using a hashed _id does exactly the opposite. It’s basically does a random spread of all the documents across the entire cluster chunks, and consequently shards. Unless your use cases falls in the 1% special cases, most of your queries will be scatter gather queries and they aren’t efficient at all.

Also, I see you are using Arbiter nodes and they are usually not recommended in production.
MongoDB Atlas doesn’t use them at all for instance… Anywhere.

There are already a few topics in the community forum covering why arbiters are generally a bad idea.

Regarding the initial question of the sizes… It doesn’t make sense to me. Unless the questions I asked above can help to find something suspicious.

Don’t forget that RS needs to be on 3 different machines to enforce HA and sharding == scaling but it shouldn’t cost you your HA status. Each mongod should be on its dedicated machine with the correct amont of ressource to work correctly.

If your 3 nodes RS is already at about 1TB data, you probably have about 150GB RAM on each node so splitting your dataset in 2 => 500GB and squeezing 2 mongod on the same node (2*500GB) with still150GB RAM is exactly the same problem but now with an extra layer of complexity and overhead. A sharded cluster needs a bunch of admin collections to work. They shouldn’t be big enough to be noticeable and create the big difference in sizes that you noticed. So I guess there is something else here.

I hope this helps a bit :smile:.
Cheers,
Maxime.

1 Like

Hi.
Thank you for explaining in as much detail as possible.

  • Did you use the same MongoDB versions in the former RS and in the sharded cluster?
    Former RS uses version 4.2.3 and dbs in sharded cluster are configured with version 4.4.5

  • Did you use the same configuration ? Not using a different WiredTiger compression algorithm 1 by any chance?
    Yes, the DB version is different, but it has the same configuration. The storage engine is wiredTiger.

  • Are these stats just about a single collection?
    I haven’t migrated all the DB collections yet, so I showed you about one of the largest collections.
    command : db.collection.stats(scale:1024*1024*1024)
    I referred to the storageSize, indexSizes data.

  • Does that collection contains exactly the same data?
    It’s exactly the same data.
    However, the number of collection documents in the before migration DB is higher .
    Because the Former RS is still in operation and that collection data is still increasing.

  • How did you gather these stats?
    As I said above, it is the stats result for one collection.
    Other collections are also much higher than the disk usage of the former db

We created three physical servers, one (config, Mongos) server and two shards, and configured RS.
These two shard servers should be combined into one shard and used as RS.
In the future, six new machine will be Added.
Is it okay to configure one primary and one secondary per shard without an Arbiter server?

As for Shard key, I think it was my mistake.
That collection is mostly tried aggregated querying for dates, it would be better to make it a shard key with a date field.
(I’m sorry that I don’t explain exactly what fields are there because it’s confidential information.)
Thanks to your advice, I will consider choosing the shard key well when sharding not only that collection but also other collections.

For some collections, only the query is performed for the _id, in which case, is it okay to use it as a hashed index for the _id if the collection is large?

I look forward to your reply.
Thank you.

Absolutely not. If you do that, both nodes are a SPOF because you need to be able to reach the majority of the voting members of a RS to be able to elect a primary… With 2 nodes, the majority is at 2. So if one of them fails, then you won’t have a primary anymore (either because it the node that failed or because the primary will step down to secondary as it cannot reach the majority of the nodes).

Arbiter are also a bad idea in general because they do not help to move the majority-commit point. If your secondary is down (for maintenance operation or to perform a backup using the file system), only your primary can perform a write operation (arbiter don’t have data so no writes) so only a single node gets the data. It’s less than 2 so that data isn’t committed on a majority of the nodes. Meaning it can potentially be rolled back if something happens to the primary before the secondary can catch up and it can’t be read by change streams for example.

It’s also creating a cache pressure on the primary which has to keep track of what is majority committed and what’s not to answer correctly the queries using the different readConcern options.

Yes, but you really only have queries on the _id and no other fields, no aggregations using something else? Where do you get that _id from then? If it’s from another collection, that looks like a join that could be avoided by embedding the document.
If using this shard key distributes the writes evenly across all the chunks (so shards by extention) and allow targeted read (no scatter gather), then it’s technically perfect.

Just to reiterate on your architecture, you want to avoid SPOFs at all cost because they can make your entire cluster instable. For example, at the moment you have your 3 config nodes on the same physical machine… Losing this machine will instantly make the entire cluster unavailable until you can restore this machine. If you lose this hard drive… You will have to restore your entire cluster from a backup because your chunk definitions won’t be aligned anymore with your data in your shards which is a terrible situation to be in…
Also, the point of having a sharded cluster is to scale up read and write operations. But you have only a single mongos which will bottleneck your performances. And also make the entire cluster unavailable if it dies.

I recommend this free course on MongoDB University to get more details:

Regarding the storage and index sizes. I really can’t tell without more investigations. There must be a reason but it’s out of my reach.
Don’t forget to size your RAM correctly so there is enough space in RAM to store the indexes + the working set + have some spare RAM for queries, sort in memory, aggregations, etc.

Cheers,
Maxime.