Can I delete the _id index after add hased index?

I have a replicaset in version 4.4 With a index on the _id field. With the growing data, we have decided to shard the collection.
We need to create a hased index to support hashbased sharding for our collection.
With sharding, would mongo require both _id index and_id_hashed index. ?
After sharding can we delete the _id index otherwise it would keep on consuming extra space. ?

Thanks in advance for the help

Hi @Ishrat_Jahan

No you cannot delete the default _id index.

With sharding, would mongo require both _id index and_id_hashed index.

Yes. They are used differently: the _id index is used to prevent duplicate documents containing the same primary key to be inserted (this is universal in MongoDB, sharded cluster or otherwise), and the _id_hashed index is the shard key (this is specific to a sharded cluster).

However I’m curious: what is the content of your _id field? Is it the auto-generated ObjectId, or something custom? Why do you need it to be hashed to serve as the shard key? Is it because it’s monotonically increasing?

Best regards
Kevin

2 Likes

We are storing orders in the mongo store. The _id is the orderId. Each orderId has a timestamp component + UUID + some suffix. Its not monotonically increasing, it will be random. The sharding technique that we intend to use is hashed and mongo requires a hashed index to support this. https://www.mongodb.com/docs/manual/core/hashed-sharding/#hashed-sharding

Its not monotonically increasing, it will be random

This may be a little late in your development timeline, but typically hashed shard key is used to solve the issue of “hot shard” or “hot chunk” (where all inserts basically just go into one shard/chunk, limiting the parallelization offered by sharding) due to a monotonically increasing shard key.

Since you have an almost-random _id, you should not have this issue. I’m curious, have you tried experimenting with sharding using just your _id as the shard key, and found hashed sharding is a better solution?

Best regards
Kevin

1 Like

If we just use the _id, without hashed, it would be a range based sharding, right ?
And range based sharding is suitable for scenarios where queries involve contiguous values. Also, it said that the hash based sharding should be used when a random distribution of the data is to be achieved

without hashed, it would be a range based sharding, right ?

Yes it’s called range sharding, but as far as I know it basically means that the shard key supports range queries. In contrast, hashed sharding does not support range queries.

range based sharding is suitable for scenarios where queries involve contiguous values.

Yes but I don’t think it’s limited to this application (queries for coniguous values). You can definitely use it for non-range queries as well.

hash based sharding should be used when a random distribution of the data is to be achieved

This is true. However from your description of your _id field, it appears that it already is semi-random (at least I think it is, due to the use of UUID). However I can’t really say for sure that it’s truly non-monotonic due to the use of timestamp in the key as well. One way to know is to simulate how the sharded cluster will behave over time using simulated workloads. If the _id field does not create any hotspots in a shard/chunk after an extended simulation, then I think it’s a valid alternative to a hashed key.

Sorry I know this is not what you’re asking and this has went off-topic. I’m just trying to provide alternative thoughts :slight_smile:

Best regards
Kevin

I will give this a try.
But thanks for your input, it was really helpful.

Regards,
Ishrat

1 Like