- Frequently Asked Questions >
- FAQ: Sharding with MongoDB
FAQ: Sharding with MongoDB¶
On this page
- Is sharding appropriate for a new deployment?
- How does sharding work with replication?
- Can I change the shard key after sharding a collection?
- What happens to unsharded collections in sharded databases?
- How does MongoDB distribute data across shards?
- What happens if a client updates a document in a chunk during a migration?
- What happens to queries if a shard is inaccessible or slow?
- How does MongoDB distribute queries among shards?
- How does MongoDB sort queries in sharded environments?
- How does MongoDB ensure unique
_id
field values when using a shard key other than_id
? - I’ve enabled sharding and added a second shard, but all the data is still on one server. Why?
- Is it safe to remove old files in the
moveChunk
directory? - How does
mongos
use connections? - Why does
mongos
hold connections open? - Where does MongoDB report on connections used by
mongos
? - What does
writebacklisten
in the log mean? - How should administrators deal with failed migrations?
- What is the process for moving, renaming, or changing the number of config servers?
- When do the
mongos
servers detect config server changes? - Is it possible to quickly update
mongos
servers after updating a replica set configuration? - What does the
maxConns
setting onmongos
do? - How do indexes impact queries in sharded systems?
- Can shard keys be randomly generated?
- Can shard keys have a non-uniform distribution of values?
- Can you shard on the
_id
field? - Can shard key be in ascending order, like dates or timestamps?
- What do
moveChunk commit failed
errors mean? - How does draining a shard affect the balancing of uneven chunk distribution?
This document answers common questions about horizontal scaling using MongoDB’s sharding.
If you don’t find the answer you’re looking for, check the complete list of FAQs or post your question to the MongoDB User Mailing List.
Frequently Asked Questions:
- Is sharding appropriate for a new deployment?
- How does sharding work with replication?
- Can I change the shard key after sharding a collection?
- What happens to unsharded collections in sharded databases?
- How does MongoDB distribute data across shards?
- What happens if a client updates a document in a chunk during a migration?
- What happens to queries if a shard is inaccessible or slow?
- How does MongoDB distribute queries among shards?
- How does MongoDB sort queries in sharded environments?
- How does MongoDB ensure unique
_id
field values when using a shard key other than_id
? - I’ve enabled sharding and added a second shard, but all the data is still on one server. Why?
- Is it safe to remove old files in the
moveChunk
directory? - How does
mongos
use connections? - Why does
mongos
hold connections open? - Where does MongoDB report on connections used by
mongos
? - What does
writebacklisten
in the log mean? - How should administrators deal with failed migrations?
- What is the process for moving, renaming, or changing the number of config servers?
- When do the
mongos
servers detect config server changes? - Is it possible to quickly update
mongos
servers after updating a replica set configuration? - What does the
maxConns
setting onmongos
do? - How do indexes impact queries in sharded systems?
- Can shard keys be randomly generated?
- Can shard keys have a non-uniform distribution of values?
- Can you shard on the
_id
field? - Can shard key be in ascending order, like dates or timestamps?
- What do
moveChunk commit failed
errors mean? - How does draining a shard affect the balancing of uneven chunk distribution?
Is sharding appropriate for a new deployment?¶
Sometimes.
If your data set fits on a single server, you should begin with an unsharded deployment.
Converting an unsharded database to a sharded cluster is easy and seamless, so there is little advantage in configuring sharding while your data set is small.
Still, all production deployments should use replica sets to provide high availability and disaster recovery.
How does sharding work with replication?¶
To use replication with sharding, deploy each shard as a replica set.
Can I change the shard key after sharding a collection?¶
No.
There is no automatic support in MongoDB for changing a shard key after sharding a collection. This reality underscores the important of choosing a good shard key. If you must change a shard key after sharding a collection, the best option is to:
- dump all data from MongoDB into an external format.
- drop the original sharded collection.
- configure sharding using a more ideal shard key.
- pre-split the shard key range to ensure initial even distribution.
- restore the dumped data into MongoDB.
See shardCollection
, sh.shardCollection()
,
Sharded Cluster Administration, the Shard Key section in the
Sharded Cluster Internals and Behaviors document,
Deploy a Sharded Cluster, and SERVER-4000 for
more information.
What happens to unsharded collections in sharded databases?¶
In the current implementation, all databases in a sharded cluster have a “primary shard.” All unsharded collection within that database will reside on the same shard.
How does MongoDB distribute data across shards?¶
Sharding must be specifically enabled on a collection. After enabling sharding on the collection, MongoDB will assign various ranges of collection data to the different shards in the cluster. The cluster automatically corrects imbalances between shards by migrating ranges of data from one shard to another.
What happens if a client updates a document in a chunk during a migration?¶
The mongos
routes the operation to the “old” shard, where
it will succeed immediately. Then the shard mongod
instances will replicate the modification to the “new” shard before
the sharded cluster updates that chunk’s “ownership,” which
effectively finalizes the migration process.
What happens to queries if a shard is inaccessible or slow?¶
If a shard is inaccessible or unavailable, queries will return with an error.
However, a client may set the partial
query bit, which will then
return results from all available shards, regardless of whether a
given shard is unavailable.
If a shard is responding slowly, mongos
will merely wait
for the shard to return results.
How does MongoDB distribute queries among shards?¶
Changed in version 2.0.
The exact method for distributing queries to shards in a
cluster depends on the nature of the query and the configuration of
the sharded cluster. Consider a sharded collection, using the
shard key user_id
, that has last_login
and
email
attributes:
For a query that selects one or more values for the
user_id
key:mongos
determines which shard or shards contains the relevant data, based on the cluster metadata, and directs a query to the required shard or shards, and returns those results to the client.For a query that selects
user_id
and also performs a sort:mongos
can make a straightforward translation of this operation into a number of queries against the relevant shards, ordered byuser_id
. When the sorted queries return from all shards, themongos
merges the sorted results and returns the complete result to the client.For queries that select on
last_login
:These queries must run on all shards:
mongos
must parallelize the query over the shards and perform a merge-sort on theemail
of the documents found.
How does MongoDB sort queries in sharded environments?¶
If you call the cursor.sort()
method on a query in a sharded
environment, the mongod
for each shard will sort its
results, and the mongos
merges each shard’s results before returning
them to the client.
How does MongoDB ensure unique _id
field values when using a shard key other than _id
?¶
If you do not use _id
as the shard key, then your
application/client layer must be responsible for keeping the _id
field unique. It is problematic for collections to have duplicate
_id
values.
If you’re not sharding your collection by the _id
field, then you
should be sure to store a globally unique identifier in that
field. The default BSON ObjectId works well in
this case.
I’ve enabled sharding and added a second shard, but all the data is still on one server. Why?¶
First, ensure that you’ve declared a shard key for your collection. Until you have configured the shard key, MongoDB will not create chunks, and sharding will not occur.
Next, keep in mind that the default chunk size is 64 MB. As a result, in most situations, the collection needs to have at least 64 MB of data before a migration will occur.
Additionally, the system which balances chunks among the servers attempts to avoid superfluous migrations. Depending on the number of shards, your shard key, and the amount of data, systems often require at least 10 chunks of data to trigger migrations.
You can run db.printShardingStatus()
to see all the chunks
present in your cluster.
Is it safe to remove old files in the moveChunk
directory?¶
Yes. mongod
creates these files as backups during normal
shard balancing operations.
Once these migrations are complete, you may delete these files.
You can set noMoveParanoia
to true
to disable this
behavior.
How does mongos
use connections?¶
Each client maintains a connection to a mongos
instance.
Each mongos
instance maintains a pool of connections to the
members of a replica set supporting the sharded cluster. Clients use
connections between mongos
and mongod
instances
one at a time. Requests are not multiplexed or pipelined. When client
requests complete, the mongos
returns the connection to the
pool.
See the System Resource Utilization section of the UNIX ulimit Settings document.
Why does mongos
hold connections open?¶
mongos
uses a set of connection pools to communicate with
each shard. These pools do not shrink when the number of
clients decreases.
This can lead to an unused mongos
with a large number
of open connections. If the mongos
is no longer in use,
it is safe to restart the process to close existing connections.
Where does MongoDB report on connections used by mongos
?¶
Connect to the mongos
with the mongo
shell, and
run the following command:
What does writebacklisten
in the log mean?¶
The writeback listener is a process that opens a long poll to relay
writes back from a mongod
or mongos
after
migrations to make sure they have not gone to the wrong server. The
writeback listener sends writes back to the correct server if
necessary.
These messages are a key part of the sharding infrastructure and should not cause concern.
How should administrators deal with failed migrations?¶
Failed migrations require no administrative intervention. Chunk moves are consistent and deterministic.
If a migration fails to complete for some reason, the cluster will retry the operation. When the migration completes successfully, the data will reside only on the new shard.
What is the process for moving, renaming, or changing the number of config servers?¶
See
Manage the Config Servers which describes this process.
When do the mongos
servers detect config server changes?¶
mongos
instances maintain a cache of the config
database that holds the metadata for the sharded cluster. This
metadata includes the mapping of chunks to
shards.
mongos
updates its cache lazily by issuing a request to a
shard and discovering that its metadata is out of date. There is no
way to control this behavior from the client, but you can run the
flushRouterConfig
command against any mongos
to force it to refresh its cache.
Is it possible to quickly update mongos
servers after updating a replica set configuration?¶
The mongos
instances will detect these changes without
intervention over time. However, if you want to force the
mongos
to reload its configuration, run the
flushRouterConfig
command against to each
mongos
directly.
What does the maxConns
setting on mongos
do?¶
The maxConns
option limits the number of connections
accepted by mongos
.
If your client driver or application creates a large number of
connections but allows them to time out rather than closing them
explicitly, then it might make sense to limit the number of
connections at the mongos
layer.
Set maxConns
to a value slightly higher than the
maximum number of connections that the client creates, or the maximum
size of the connection pool. This setting prevents the
mongos
from causing connection spikes on the individual
shards. Spikes like these may disrupt the operation
and memory allocation of the sharded cluster.
How do indexes impact queries in sharded systems?¶
If the query does not include the shard key, the
mongos
must send the query to all shards as a
“scatter/gather” operation. Each shard will, in turn, use either the
shard key index or another more efficient index to fulfill the query.
If the query includes multiple sub-expressions that reference the
fields indexed by the shard key and the secondary index, the
mongos
can route the queries to a specific shard and the
shard will use the index that will allow it to fulfill most
efficiently. See this presentation
for more information.
Can shard keys be randomly generated?¶
Shard keys can be random. Random keys ensure optimal distribution of data across the cluster.
Sharded clusters, attempt to route queries to specific shards when queries include the shard key as a parameter, because these directed queries are more efficient. In many cases, random keys can make it difficult to direct queries to specific shards.
Can shard keys have a non-uniform distribution of values?¶
Yes. There is no requirement that documents be evenly distributed by the shard key.
However, documents that have the same shard key must reside in the same chunk and therefore on the same server. If your sharded data set has too many documents with the exact same shard key you will not be able to distribute those documents across your sharded cluster.
Can you shard on the _id
field?¶
You can use any field for the shard key. The _id
field is a common
shard key.
Be aware that ObjectId()
values, which are the default value of
the _id
field, increment as a timestamp. As a result, when used as
a shard key, all new documents inserted into the collection will
initially belong to the same chunk on a single shard. Although the
system will eventually divide this chunk and migrate its contents to
distribute data more evenly, at any moment the cluster can only direct
insert operations at a single shard. This can limit the throughput of
inserts. If most of your write operations are updates or read
operations rather than inserts, this limitation should not impact
your performance. However, if you have a high insert volume, this may
be a limitation.
Can shard key be in ascending order, like dates or timestamps?¶
If you insert documents with monotonically increasing shard keys, all inserts will initially belong to the same chunk on a single shard. Although the system will eventually divide this chunk and migrate its contents to distribute data more evenly, at any moment the cluster can only direct insert operations at a single shard. This can limit the throughput of inserts.
If most of your write operations are updates or read operations rather than inserts, this limitation should not impact your performance. However, if you have a high insert volume, a monotonically increasing shard key may be a limitation.
To address this issue, you can use a field with a value that stores the hash of a key with an ascending value. While you can compute a hashed value in your application and include this value in your documents for use as a shard key, the SERVER-2001 issue will implement this capability within MongoDB.
What do moveChunk commit failed
errors mean?¶
Consider the following error message:
mongod
issues this message if, during a chunk
migration, the shard could not
connect to the config database to update chunk information at
the end of the migration process. If
the shard cannot update the config database after
moveChunk
, the cluster will have an inconsistent
view of all chunks. In these situations, the primary member of
the shard will terminate
itself to prevent data inconsistency. If the secondary member
can access the config database, the shard’s data will be
accessible after an election. Administrators will need to resolve the
chunk migration failure independently.
If you encounter this issue, contact the MongoDB User Group or MongoDB support to address this issue.
How does draining a shard affect the balancing of uneven chunk distribution?¶
The sharded cluster balancing process controls both migrating chunks from decommissioned shards (i.e. draining,) and normal cluster balancing activities. Consider the following behaviors for different versions of MongoDB in situations where you remove a shard in a cluster with an uneven chunk distribution:
- After MongoDB 2.2, the balancer first removes the chunks from the draining shard and then balances the remaining uneven chunk distribution.
- Before MongoDB 2.2, the balancer handles the uneven chunk distribution and then removes the chunks from the draining shard.