Docs Menu
Docs Home
/
MongoDB Manual
/

Sharded Cluster Balancer

On this page

  • Balancer Internals
  • Range Migration Procedure
  • Shard Size
  • Chunk Size and Balancing

The MongoDB balancer is a background process that monitors the amount of data on each shard for each sharded collection. When the amount of data for a sharded collection on a given shard reaches specific migration thresholds, the balancer attempts to automatically migrate data between shards and reach an even amount of data per shard while respecting the zones. By default, the balancer process is always enabled.

The balancing procedure for sharded clusters is entirely transparent to the user and application layer, though there may be some performance impact while the procedure takes place.

Diagram of a collection distributed across three shards. For this collection, the difference in the number of chunks between the shards reaches the *migration thresholds* (in this case, 2) and triggers migration.

The balancer runs on the primary of the config server replica set (CSRS).

To configure collection balancing for a single collection, see configureCollectionBalancing.

To manage the sharded cluster balancer, see Manage Sharded Cluster Balancer.

Range migrations carry some overhead in terms of bandwidth and workload, both of which can impact database performance. The balancer attempts to minimize the impact by:

  • Restricting a shard to at most one migration at any given time. Specifically, a shard cannot participate in multiple data migrations at the same time. The balancer migrates ranges one at a time.

    MongoDB can perform parallel data migrations, but a shard can participate in at most one migration at a time. For a sharded cluster with n shards, MongoDB can perform at most n/2 (rounded down) simultaneous migrations.

    See also Asynchronous Range Migration Cleanup.

  • Starting a balancing round only when the difference in the amount of data between the shard with the most data for a sharded collection and the shard with the least data for that collection reaches the migration threshold.

You can disable the balancer temporarily for maintenance, but leaving the balancer disabled for extended periods of time can degrade cluster performance. For more information, see Disable the Balancer.

You can also limit the window during which the balancer runs to prevent it from impacting production traffic. See Schedule the Balancing Window for details.

Note

The specification of the balancing window is relative to the local time zone of the primary of the config server replica set.

Tip

See also:

Adding a shard to a cluster creates an imbalance, since the new shard has no data. While MongoDB begins migrating data to the new shard immediately, it can take some time before the cluster balances. See the Add Shards to a Cluster tutorial for instructions on adding a shard to a cluster.

Tip

If your application meets the Before you Begin, you can use the reshardCollection command to redistribute data across the cluster to include the new shards. This process is much faster than the alternative Range Migration Procedure.

For an example, see Redistribute Data to New Shards.

Removing a shard from a cluster creates a similar imbalance, since data residing on that shard must be redistributed throughout the cluster. While MongoDB begins draining a removed shard immediately, it can take some time before the cluster balances. Do not shutdown the servers associated to the removed shard during this process.

When you remove a shard in a cluster with an uneven chunk distribution, the balancer first removes the chunks from the draining shard and then balances the remaining uneven chunk distribution.

See the Remove Shards from a Cluster tutorial for instructions on safely removing a shard from a cluster.

All range migrations use the following procedure:

  1. The balancer process sends the moveRange command to the source shard.

  2. The source starts the move when it receives an internal moveRange command. During the migration process, operations to the range are sent to the source shard. The source shard is responsible for incoming write operations for the range.

  3. The destination shard builds any indexes required by the source that do not exist on the destination.

  4. The destination shard begins requesting documents in the range and starts receiving copies of the data. See also Range Migration and Replication.

  5. After receiving the final document in the range, the destination shard starts a synchronization process to ensure that it has the changes to the migrated documents that occurred during the migration.

  6. When fully synchronized, the source shard connects to the config database and updates the cluster metadata with the new location for the range.

  7. After the source shard completes the update of the metadata, and once there are no open cursors on the range, the source shard deletes its copy of the documents.

    Note

    If the balancer needs to perform additional chunk migrations from the source shard, the balancer can start the next chunk migration without waiting for the current migration process to finish this deletion step. See Asynchronous Range Migration Cleanup.

Warning

Secondary Reads in a Sharded Cluster with Migrations Can Miss Documents

Long-running secondary reads in a sharded cluster can miss documents if migrations are occurring.

Before deleting a chunk during chunk migration, MongoDB waits for orphanCleanupDelaySecs, or for in-progress queries involving the chunk to complete on the shard primary, whichever is longer. Queries that were initially run on a node that was primary, but continue after the node has stepped down to a secondary, will be treated as if they were initially executed on a secondary. That is, the server only waits for orphanDelayCleanupSecs if there are no queries targeting the chunk on the current primary.

Queries that target the chunk and are run on secondaries may miss documents if these queries take longer than orphanCleanupDelaySecs.

To minimize the impact of balancing on the cluster, the balancer only begins balancing after the distribution of data for a sharded collection has reached certain thresholds.

A collection is considered balanced if the difference in data between shards (for that collection) is less than three times the configured range size for the collection. For the default range size of 128MB, two shards must have a data size difference for a given collection of at least 384MB for a migration to occur.

To migrate data from a shard, the balancer migrates the data one range at a time. However, the balancer does not wait for the current migration's delete phase to complete before starting the next range migration. See Range Migration for the range migration process and the delete phase.

This queuing behavior allows shards to unload data more quickly in cases of heavily imbalanced cluster, such as when performing initial data loads without pre-splitting and when adding new shards.

This behavior also affects the moveRange command, and migration scripts that use the moveRange command may proceed more quickly.

In some cases, the delete phases may persist longer. Range migrations are enhanced to be more resilient in the event of a failover during the delete phase. Orphaned documents are cleaned up even if a replica set's primary crashes or restarts during this phase.

The _waitForDelete balancer setting can alter the behavior so that the delete phase of the current migration blocks the start of the next chunk migration. The _waitForDelete is generally for internal testing purposes. For more information, see Wait for Delete.

During range migration, the _secondaryThrottle value determines when the migration proceeds with next document in the range.

In the config.settings collection:

  • If the _secondaryThrottle setting for the balancer is set to a write concern, each document moved during range migration must receive the requested acknowledgment before proceeding with the next document.

  • If the _secondaryThrottle setting is unset, the migration process does not wait for replication to a secondary and instead continues with the next document.

To update the _secondaryThrottle parameter for the balancer, see Secondary Throttle for an example.

Independent of any _secondaryThrottle setting, certain phases of the range migration have the following replication policy:

  • MongoDB briefly pauses all application reads and writes to the collection being migrated to on the source shard before updating the config servers with the range location. MongoDB resumes application reads and writes after the update. The range move requires all writes to be acknowledged by majority of the members of the replica set both before and after committing the range move to config servers.

  • When an outgoing migration finishes and cleanup occurs, all writes must be replicated to a majority of servers before further cleanup (from other outgoing migrations) or new incoming migrations can proceed.

To update the _secondaryThrottle setting in the config.settings collection, see Secondary Throttle for an example.

By default, MongoDB cannot move a range if the number of documents in the range is greater than 2 times the result of dividing the configured range size by the average document size. If MongoDB can move a sub-range of a chunk and reduce the size to less than that, the balancer does so by migrating a range. db.collection.stats() includes the avgObjSize field, which represents the average document size in the collection.

For chunks that are too large to migrate:

  • The balancer setting attemptToBalanceJumboChunks allows the balancer to migrate chunks too large to move as long as the chunks are not labeled jumbo. See Balance Ranges that Exceed Size Limit for details.

    When issuing moveRange and moveChunk commands, it's possible to specify the forceJumbo option to allow for the migration of ranges that are too large to move. The ranges may or may not be labeled jumbo.

You can tune the performance impact of range deletions with rangeDeleterBatchSize, rangeDeleterBatchDelayMS, and rangeDeleterHighPriority parameters. For example:

  • To limit the number of documents deleted per batch, you can set rangeDeleterBatchSize to a small value such as 32.

  • To add an additional delay between batch deletions, you can set rangeDeleterBatchDelayMS above the current default of 20 milliseconds.

  • To prioritize range deletions, you can set rangeDeleterHighPriority to true. Range deletions are potentially long-running background tasks that might negatively impact the throughput of user operations when the system is under heavy load.

Note

If there are ongoing read operations or open cursors on the collection targeted for deletes, range deletion processes may not proceed.

Starting in MongoDB 5.3, during range migration, change stream events are not generated for updates to orphaned documents.

By default, MongoDB attempts to fill all available disk space with data on every shard as the data set grows. To ensure that the cluster always has the capacity to handle data growth, monitor disk usage as well as other performance metrics.

See the Change the Maximum Storage Size for a Given Shard tutorial for instructions on setting the maximum size for a shard.

For an introduction to chunkSize, see Modify Range Size in a Sharded Cluster.

The following table describes how chunkSize affects defragmentation and the balancer operations in different MongoDB versions.

MongoDB Version
Description
MongoDB 6.0 and later

When the collection data shared between two shards differs by three or more times the configured chunkSize setting, the balancer migrates chunks between the shards.

For example, if chunkSize is 128 MB and the collection data differs by 384 MB or more, the balancer migrates chunks between the shards.

Earlier than MongoDB 6.0
When a chunk grows larger than chunkSize, the chunk is split.

When chunks are moved, split, or merged, the shard metadata is updated after the chunk operation is committed by a config server. Shards not involved in the chunk operation are also updated with new metadata.

The time for the shard metadata update is proportional to the size of the routing table. CRUD operations on the collection are temporarily blocked while the shard metadata is updated, and a smaller routing table means shorter CRUD operation delays.

Defragmenting a collection reduces the number of chunks and the time to update the chunk metadata.

To reduce the system workload, configure the balancer to run only at a specific time using a shard balancing window. Defragmentation runs during the balancing window time period.

You can use the chunkDefragmentationThrottlingMS parameter to limit the rate of split and merge commands run by the balancer.

You can start and stop defragmentation at any time.

You can also set a shard zone. A shard zone is based on the shard key, and you can associate each zone with one or more shards in a cluster.

Starting in MongoDB 6.0, a sharded cluster only splits chunks when chunks must be migrated. This means the chunk size may exceed chunkSize. Larger chunks reduce the number of chunks on a shard and improve performance because the time to update the shard metadata is reduced. For example, you might see a 1 TB chunk on a shard even though you have set chunkSize to 256 MB.

chunkSize affects the following:

  • Maximum amount of data the balancer attempts to migrate between two shards in a single chunk migration operation.

  • Amount of data migrated during defragmentation.

For details about defragmenting sharded collections, see Defragment Sharded Collections.

Back

Stop Moving a Collection