Sharding Pitfalls: Part I



By Adam Comerford, Senior Solutions Engineer

Sharding is a popular feature in MongoDB, primarily used for distributing data across clusters for horizontal scaling. The benefits of sharding for scalability are well known, and often one of the major factors in choosing MongoDB in the first place, but as you add complexity to a distributed system, you increase the chances of hitting a problem.

The good news is that many of the common issues people encounter when moving to a sharded environment are avoidable, and most of them can be mitigated if you have already hit them.

Forewarned is forearmed and so with that in mind, we want users to be aware of best practices and situations to avoid when introducing sharding into your environment. In this three part blog series we will discuss several pitfalls and gotchas that we have seen occur with some regularity among MongoDB users. We’ll give an overview of the problem, how it occurs, how to avoid it and then discuss some possible mitigation strategies to employ if you have already run into this problem.

It should be noted that some of these topics are worthy of full technical articles in their own right, which is beyond the scope of a relatively short blog post. Think of these post as a good starting point and, if you have not yet hit any of these problems, an informative cautionary tale for anyone running a sharded MongoDB cluster. For additional details, please view the Sharding section in the MongoDB Manual.

Many of these topics are also covered as part of the M102 (MongoDB for DBAs) and M202 (Advanced Deployment and Operations) classes that are available for free on MongoDB University.

For our first set of cautionary tales we will focus on shard keys.

1. Using a monotonically increasing shard key (like ObjectID)

Although this is one of the most commonly covered topics on blogs, training material, MongoDB Days and more, the selection of a shard key remains a formidable exercise for the novice MongoDB DBA or developer.

The most common mistake we see is the selection of a monotonically increasing shard key when using range-based sharding rather than hashed sharding, which is a fancy way of saying the shard key value for new documents only increases. Examples of this would be a timestamp (naturally) or anything that has a time component as its most significant component like ObjectID (first 4 bytes are a time stamp).

Why is it a bad idea?

The short answer is insert scalability. If you select such a shard key, all inserts (new documents) will go to a single chunk - the highest range chunk, and that will never change. Hence, regardless of how many shards you add, your maximum write capacity will never increase - you will only ever write new documents to a single chunk and that chunk will only ever live on a single shard.

Occasionally, this type of shard key can be the correct choice, but if so then you won’t be able to scale for write capacity.

Possible Mitigation Strategies

  • Change the shard key - this is problematic with large collections, because the data essentially has to be dumped out and re-imported

  • More specifically, use a hash based shard key, which will allow the use of the same field while providing good write scalability.

2. Trying to Change Value of the Shard Key

Shard keys are immutable (cannot be changed) for an existing document. This issue usually only crops up when sharding a previously unsharded collection. Prior to sharding, certain updates will be possible that are no longer possible after the collection has been sharded.

Attempting to update the shard key for an existing document will fail with the following error:

cannot modify shard key's value fieldid for collection:

Possible Mitigation Strategies

  • Delete and re-insert the document to alter the shard key rather than attempting to update it in-place. It should be noted that this will not be an atomic operation, so must be done with caution.

Now you have a better understanding of how to choose and change your shard key if needed. In our next post, we will go through some potential obstacles you will face when scaling your sharded environment.

If you want more insight on scaling techniques for MongoDB, view the slides and video from our recent webinar on how to achieve scale with MongoDB, which reviews three different ways to achieve scale with MongoDB.

Read Part II in the series and see what to look out for when running a sharded cluster