Performance Best Practices: Sharding

Mat Keep and Henrik Ingo

#sharding

Welcome to the fourth in a series of blog posts covering performance best practices for MongoDB.

In this series, we are covering key considerations for achieving performance at scale across a number of important dimensions, including:

Sharding for Horizontal Scale Out

Through sharding, you can automatically scale your MongoDB database out across multiple nodes and regions to handle write-intensive workloads and growing data sizes, and for data residency.

Sharding with MongoDB allows you to seamlessly scale the database as your applications grow beyond the hardware limits of a single server, and it does so without adding complexity to the application.

To respond to changing workload demand, documents can be migrated between shards, and nodes can be added or removed from the cluster at any time – MongoDB will automatically rebalance the data as needed without manual intervention.

Sharding Strategies in MongoDB

By simply hashing a primary key value, most distributed databases randomly spray data across a cluster of nodes. This imposes performance penalties when data is queried across nodes and adds application complexity when data needs to be localized to a specific region.

By exposing multiple sharding policies, MongoDB offers you a better approach. Data can be distributed according to query patterns or data placement requirements, giving you much higher scalability across a diverse set of workloads:

  • Ranged Sharding. Documents are partitioned across shards according to the shard key value. Documents with shard key values close to one another are likely to be co-located on the same shard. This approach is well suited for applications that need to optimize range based queries, such as co-locating data for all customers in a specific region on a specific shard.
  • Hashed Sharding. Documents are distributed according to an MD5 hash of the shard key value. This approach guarantees a uniform distribution of writes across shards, which is often optimal for ingesting streams of time-series and event data.
  • Zoned Sharding. Provides the ability for developers to define specific rules governing data placement in a sharded cluster.

Global Clusters in MongoDB Atlas – the fully managed cloud database service – allows you to quickly implement zoned sharding using a visual UI or the Atlas API. You can easily create distributed databases to support geographically distributed apps, with policies enforcing data residence within specific regions.

Serving always-on, globally distributed, write-everywhere apps with MongoDB Atlas Global Clusters
Figure 1: Serving always-on, globally distributed, write-everywhere apps with MongoDB Atlas Global Clusters

To ensure you get the full benefit of sharding, there are a set of best practices you should observe.

Ensure a Uniform Distribution of Shard Keys

When shard keys are not uniformly distributed for reads and writes, operations may be limited by the capacity of a single shard. When shard keys are uniformly distributed, no single shard will limit the capacity of the system.

You can learn more about shard keys from the documentation

Avoid Scatter-Gather Queries for Operational Workloads

In sharded systems, queries that cannot be routed based on shard key must be broadcast to all shards for evaluation. Because these queries involve multiple shards for each request they do not scale linearly as more shards are added, and extra overhead is incurred to combine results from multiple shards. You should include the shard key in your query to avoid scatter-gather queries.

The exception to this rule is for large aggregation queries. In these cases, scatter-gather can be a useful approach as it allows the query to run in parallel on all shards.

Use Hashed-Based Sharding When Appropriate

For applications that issue range-based queries, ranged sharding is beneficial because operations can be routed to the fewest shards necessary, usually a single shard. However, ranged sharding requires a good understanding of your data and query patterns, which in some cases may not be practical.

Hashed sharding ensures a uniform distribution of reads and writes, but it does not provide efficient range-based operations.

Pre-Split and Distribute Chunks

When creating a new sharded collection to load data into, first create empty chunks and distribute them evenly on all shards, and then load the data. For hash-based sharding, you can use numInitialChunks to do this automatically.

What’s Next

That wraps up this installment of the performance best practices series. Next up: transactions.