Choosing a shard key

Coder_codaira · February 27, 2020, 10:56am

Thinking to manage clustering of mongodb thinking use shards.

I am confuse if we use _id as shard key is it a best practice can we later change shard key?

Natac13 · February 27, 2020, 11:11am

Also from my learning a Mongo University the _id field is not a good selection for the shard key as it is based off time which is a monotonically increasing value. You could make a compound shard key with the _id field though.

Prasad_Saya · February 27, 2020, 11:56am

Sharding is about horizontal scaling; i.e., distributing a collection’s data across shards in a cluster. There are two types of sharding: Ranged and Hashed sharding.

Yes, you can use _id as shard key or part of the shard key. Before that, you have to find if it makes a good shard key for your application. A good shard key will allow the data to be distributed evenly among the shards. And, the important queries that access the data must use the shard key as part of the query filter criteria. Note that without the shard key as part of the query criteria the queries will be very slow and inefficient.

The shard key field is used to shard a collection. Once a collection is sharded you cannot change the shard key field (e.g., if a collection is sharded using the “product_id” field, later you cannot change the shard key to “product_name”).

Also, see documentaion about Choosing a Shard Key and Monotonically Changing Shard Keys.

Note, the _id field is of the category of “Monotonically Changing Shard Keys”.

chris · February 27, 2020, 2:19pm

The _id can be an absolutely fine shard key as it is not required to be an objectID. It can be any BSON except an array. https://docs.mongodb.com/manual/core/document/#the-id-field

Natac13 · February 27, 2020, 3:11pm

The default _id is monotonically increasing based off time, as @Prasad_Saya stated. So if you use a hashed shard key with _id then it is a good selection, but not as the default value.

Coder_codaira · February 27, 2020, 3:16pm

Thank you for your replies actually supporting lagacy code and queries to mongo are not optimized I want to horizontally scale the server what are the options?

Prasad_Saya · February 27, 2020, 4:34pm

Applications with not optimized queries are hard on the users (and their work). This should be a definite and immediate concern. MongoDB has tools and techniques to optimize the existing queries and make them perform better.

I think having little more information about the data and code (the application) will help to suggest and discuss. The amount of data and the kind of application are important factors. Sharding is recommended for certain amounts of capacities where vertical scaling becomes expensive.

Sharding also means determining the shard key, analyzing and building the queries, and of course the new hardware configuration. The hardware and the software to put together a sharded cluster is more complex. So is its building and maintainance. And then the budget.

These are just some initial thoughts.

Also see:

errythroidd · February 27, 2020, 6:16pm

you can shard a collection on _id using hashed sharding. In my honest opinion, that is the easiest way to shard a collection and it will distribute your data across shards and get you additional performance and storage. However, as @Prasad_Saya stated, you need to build the sharded cluster which involves additional hardware and configuration.

Keep in my mind the below two caveats with sharding:-

Once you shard a collection, the selection of the shard key is immutable; i.e. you cannot select a different shard key for that collection.
Starting in MongoDB 4.2, you can update a document’s shard key value unless the shard key field is the immutable _id field. For details on updating the shard key, see Change a Document’s Shard Key Value.Before MongoDB 4.2, a document’s shard key field value is immutable.

More details about shard keys can be found here.

Asya_Kamsky · February 27, 2020, 8:46pm

you can shard a collection on _id using hashed sharding. In my honest opinion, that is the easiest way to shard a collection and it will distribute your data across shards and get you additional performance and storage.

Not necessarily. If majority of your queries are not by _id then you will run into a scaling problem because all of those queries will be “scatter gather” - meaning they will be sent to every shard.

Now your system is always as slow as the slowest of all the shards…

errythroidd · February 28, 2020, 5:26am

Scatter gather queries are not necessarily always bad. I guess it really depends on the data sets, but in some cases, if you shard the collection then the data set is smaller on each shard and easily fits in the memory making the queries faster despite being scatter gather. The goal here should be to fit the working data set in memory.

Below are my assumptions behind my theory:-
a. All the shards in the cluster are identical (equally sized with similar IOPS and Memory).
b. the data is more or less evenly distributed across shards. (in most cases, hashed sharding does give even distribution, although there are cases when it does not happen).
c. unable to choose a shard key which satisfies all the queries .

Amalik_Amriou · February 28, 2020, 2:38pm

Welcome aboard @errythroidd :

Asya_Kamsky · February 29, 2020, 4:00am

I’m afraid that’s not really correct. It seems like having less data on each shard will help, but think about linear scaling - if you double the number of shards, you want things to get twice as fast. And while the amount of data on each shard will be half what it was before, a scatter gather query will actually generate double the number of queries you had before.

The “magic” of sharding as a method for horizontal scaling lies in targeted queries, not scatter-gather queries.

Asya