- Data Models >
- Data Modeling Concepts >
- Operational Factors and Data Models
Operational Factors and Data Models¶
On this page
Modeling application data for MongoDB depends on both the data itself, as well as the characteristics of MongoDB itself. For example, different data models may allow applications to use more efficient queries, increase the throughput of insert and update operations, or distribute activity to a sharded cluster more effectively.
These factors are operational or address requirements that arise outside of the application but impact the performance of MongoDB based applications. When developing a data model, analyze all of your application’s read operations and write operations in conjunction with the following considerations.
Document Growth¶
Some updates to documents can increase the size of documents.
These updates include pushing elements to an array (i.e.
$push
) and adding new fields to a document. If the document
size exceeds the allocated space for that document, MongoDB will
relocate the document on disk. Relocating documents takes longer than
in place updates and can lead to fragmented storage. Although
MongoDB automatically adds padding to document allocations to minimize the likelihood of relocation, data
models should avoid document growth when possible.
For instance, if your applications require updates that will cause document growth, you may want to refactor your data model to use references between data in distinct documents rather than a denormalized data model.
MongoDB adaptively adjusts the amount of automatic padding to reduce occurrences of relocation. You may also use a pre-allocation strategy to explicitly avoid document growth. Refer to the Pre-Aggregated Reports Use Case for an example of the pre-allocation approach to handling document growth.
See Storage for more information on MongoDB’s storage model and record allocation strategies.
Atomicity¶
In MongoDB, operations are atomic at the document level. No single write operation can change more than one document. Operations that modify more than a single document in a collection still operate on one document at a time. [1] Ensure that your application stores all fields with atomic dependency requirements in the same document. If the application can tolerate non-atomic updates for two pieces of data, you can store these data in separate documents.
A data model that embeds related data in a single document facilitates these kinds of atomic operations. For data models that store references between related pieces of data, the application must issue separate read and write operations to retrieve and modify these related pieces of data.
See Model Data for Atomic Operations for an example data model that provides atomic updates for a single document.
[1] | Document-level atomic operations include all operations within a single MongoDB document record: operations that affect multiple embedded documents within that single record are still atomic. |
Sharding¶
MongoDB uses sharding to provide horizontal scaling. These
clusters support deployments with large data sets and high-throughput
operations. Sharding allows users to partition a
collection within a database to distribute the collection’s
documents across a number of mongod
instances or
shards.
To distribute data and application traffic in a sharded collection, MongoDB uses the shard key. Selecting the proper shard key has significant implications for performance, and can enable or prevent query isolation and increased write capacity. It is important to consider carefully the field or fields to use as the shard key.
See Sharding Introduction and Shard Keys for more information.
Indexes¶
Use indexes to improve performance for common queries. Build indexes on
fields that appear often in queries and for all operations that return
sorted results. MongoDB automatically creates a unique index on the
_id
field.
As you create indexes, consider the following behaviors of indexes:
- Each index requires at least 8KB of data space.
- Adding an index has some negative performance impact for write operations. For collections with high write-to-read ratio, indexes are expensive since each insert must also update any indexes.
- Collections with high read-to-write ratio often benefit from additional indexes. Indexes do not affect un-indexed read operations.
- When active, each index consumes disk space and memory. This usage can be significant and should be tracked for capacity planning, especially for concerns over working set size.
See Indexing Strategies for more information on indexes as well as Analyze Query Performance. Additionally, the MongoDB database profiler may help identify inefficient queries.
Large Number of Collections¶
In certain situations, you might choose to store related information in several collections rather than in a single collection.
Consider a sample collection logs
that stores log documents for
various environment and applications. The logs
collection contains
documents of the following form:
If the total number of documents is low, you may group documents into
collection by type. For logs, consider maintaining distinct log
collections, such as logs_dev
and logs_debug
. The logs_dev
collection would contain only the documents related to the dev
environment.
Generally, having a large number of collections has no significant performance penalty and results in very good performance. Distinct collections are very important for high-throughput batch processing.
When using models that have a large number of collections, consider the following behaviors:
Each collection has a certain minimum overhead of a few kilobytes.
Each index, including the index on
_id
, requires at least 8KB of data space.For each database, a single namespace file (i.e.
<database>.ns
) stores all meta-data for that database, and each index and collection has its own entry in the namespace file. MongoDB placeslimits on the size of namespace files
.MongoDB has
limits on the number of namespaces
. You may wish to know the current number of namespaces in order to determine how many additional namespaces the database can support. To get the current number of namespaces, run the following in themongo
shell:The limit on the number of namespaces depend on the
<database>.ns
size. The namespace file defaults to 16 MB.To change the size of the new namespace file, start the server with the option
--nssize <new size MB>
. For existing databases, after starting up the server with--nssize
, run thedb.repairDatabase()
command from themongo
shell. For impacts and considerations on runningdb.repairDatabase()
, seerepairDatabase
.
Data Lifecycle Management¶
Data modeling decisions should take data lifecycle management into consideration.
The Time to Live or TTL feature of collections expires documents after a period of time. Consider using the TTL feature if your application requires some data to persist in the database for a limited period of time.
Additionally, if your application only uses recently inserted documents, consider Capped Collections. Capped collections provide first-in-first-out (FIFO) management of inserted documents and efficiently support operations that insert and read documents based on insertion order.