Hi Deepak,
You can find the server implementation on GitHub: mongodb/src/mongodb/hasher.cpp
.
If you just want to see what the hashed values look like, there is a convertShardKeyToHashed()
helper in the mongo
4.0+ shell.
For example, two ObjectIDs generated in close sequence will be roughly ascending:
> o = new ObjectId()
ObjectId("5ea77f07fcb94966883f5d5e")
> o2 = new ObjectId()
ObjectId("5ea77f0afcb94966883f5d5f")
However, the hashed values will be very different:
> convertShardKeyToHashed(o)
NumberLong("329356589482501449")
> convertShardKeyToHashed(o2)
NumberLong("-3285311604932298140")
Since data in a sharded collection is distributed based on contiguous chunk ranges, naive sharding on {_id:1}
with default ObjectIDs would result in new inserts always targeting the single shard which currently has the chunk with the maximum _id
value. This creates a “hot shard” scenario with inserts targeting a single shard plus the additional overhead of frequently rebalancing documents to other shards. Adding more shards will not help scale inserts with this poor choice of monotonically increasing shard key.
If you instead shard based on hashed ObjectId values (using { _id: 'hashed'}
), consecutive inserts will land in different chunk ranges and (on average), different shards. Since inserts should be well distributed across all shards, rebalancing activity should be infrequent and only necessary when you add new shards for scaling (or remove existing shards).
As mentioned earlier, you should definitely consider whether hashed-based sharding is the best approach for your use case. If there is a candidate shard key based on one or more fields present in all of your documents, you will have more options for data locality using the default range-based sharding approach and Choosing a good shard key.
Regards,
Stennie