- Core MongoDB Operations (CRUD) >
- Data Modeling Considerations for MongoDB Applications
Data Modeling Considerations for MongoDB Applications¶
On this page
Overview¶
Data in MongoDB has a flexible schema. Collections do not enforce document structure. This means that:
- documents in the same collection do not need to have the same set of fields or structure, and
- common fields in a collection’s documents may hold different types of data.
Each document only needs to contain relevant fields to the entity or object that the document represents. In practice, most documents in a collection share a similar structure. Schema flexibility means that you can model your documents in MongoDB so that they can closely resemble and reflect application-level objects.
As in all data modeling, when developing data models (i.e. schema designs,) for MongoDB you must consider the inherent properties and requirements of the application objects and the relationships between application objects. MongoDB data models must also reflect:
- how data will grow and change over time, and
- the kinds of queries your application will perform.
These considerations and requirements force developers to make a number of multi-factored decisions when modeling data, including:
normalization and de-normalization.
These decisions reflect degree to which the data model should store related pieces of data in a single document or should the data model describe relationships using references between documents.
representation of data in arrays in BSON.
Although a number of data models may be functionally equivalent for a given application; however, different data models may have significant impacts on MongoDB and applications performance.
This document provides a high level overview of these data modeling decisions and factors. In addition, consider, the Data Modeling Patterns and Examples section which provides more concrete examples of all the discussed patterns.
Data Modeling Decisions¶
Data modeling decisions involve determining how to structure the documents to model the data effectively. The primary decision is whether to embed or to use references.
Embedding¶
To de-normalize data, store two related pieces of data in a single document.
Operations within a document are less expensive for the server than operations that involve multiple documents.
In general, use embedded data models when:
- you have “contains” relationships between entities. See Model Embedded One-to-One Relationships Between Documents.
- you have one-to-many relationships where the “many” objects always appear with or are viewed in the context of their parent documents. See Model Embedded One-to-Many Relationships Between Documents.
Embedding provides the following benefits:
- generally better performance for read operations.
- the ability to request and retrieve related data in a single database operation.
Embedding related data in documents, can lead to situations where
documents grow after creation. Document growth can impact write
performance and lead to data fragmentation. Furthermore, documents in
MongoDB must be smaller than the maximum BSON document size
. For larger documents, consider using
GridFS.
See also
- dot notation for information on “reaching into” embedded sub-documents.
- Arrays for more examples on accessing arrays.
- Subdocuments for more examples on accessing subdocuments.
Referencing¶
To normalize data, store references between two documents to indicate a relationship between the data represented in each document.
In general, use normalized data models:
- when embedding would result in duplication of data but would not provide sufficient read performance advantages to outweigh the implications of the duplication.
- to represent more complex many-to-many relationships.
- to model large hierarchical data sets. See Model Tree Structures in MongoDB.
Referencing provides more flexibility than embedding; however, to resolve the references, client-side applications must issue follow-up queries. In other words, using references requires more roundtrips to the server.
See Model Referenced One-to-Many Relationships Between Documents for an example of referencing.
Atomicity¶
MongoDB only provides atomic operations on the level of a single document. [1] As a result needs for atomic operations influence decisions to use embedded or referenced relationships when modeling data for MongoDB.
Embed fields that need to be modified together atomically in the same document. See Model Data for Atomic Operations for an example of atomic updates within a single document.
[1] | Document-level atomic operations include all operations within a single MongoDB document record: operations that affect multiple sub-documents within that single record are still atomic. |
Operational Considerations¶
In addition to normalization and normalization concerns, a number of other operational factors help shape data modeling decisions in MongoDB. These factors include:
- data lifecycle management,
- number of collections and
- indexing requirements,
- sharding, and
- managing document growth.
These factors implications for database and application performance as well as future maintenance and development costs.
Data Lifecycle Management¶
Data modeling decisions should also take data lifecycle management into consideration.
The Time to Live or TTL feature of collections expires documents after a period of time. Consider using the TTL feature if your application requires some data to persist in the database for a limited period of time.
Additionally, if your application only uses recently inserted documents consider Capped Collections. Capped collections provide first-in-first-out (FIFO) management of inserted documents and optimized to support operations that insert and read documents based on insertion order.
Large Number of Collections¶
In certain situations, you might choose to store information in several collections rather than in a single collection.
Consider a sample collection logs
that stores log documents for
various environment and applications. The logs
collection contains
documents of the following form:
If the total number of documents is low you may group documents into
collection by type. For logs, consider maintaining distinct log
collections, such as logs.dev
and logs.debug
. The logs.dev
collection would contain only the documents related to the dev
environment.
Generally, having large number of collections has no significant performance penalty and results in very good performance. Distinct collections are very important for high-throughput batch processing.
When using models that have a large number of collections, consider the following behaviors:
- Each collection has a certain minimum overhead of a few kilobytes.
- Each index, including the index on
_id
, requires at least 8KB of data space.
A single <database>.ns
file stores all meta-data for each
database. Each index and collection has its own entry in the
namespace file, MongoDB places limits on the size of namespace
files
.
Because of limits on namespaces
, you
may wish to know the current number of namespaces in order to determine
how many additional namespaces the database can support, as in the
following example:
The <database>.ns
file defaults to 16 MB. To change
the size of the <database>.ns
file, pass a new size to
--nssize option <new size MB>
on server
start.
The --nssize
sets the size for new
<database>.ns
files. For existing databases, after starting up the
server with --nssize
, run the
db.repairDatabase()
command from the mongo
shell.
Indexes¶
Create indexes to support common queries. Generally, indexes and index
use in MongoDB correspond to indexes and index use in relational
database: build indexes on fields that appear often in queries and for
all operations that return sorted results. MongoDB automatically
creates a unique index on the _id
field.
As you create indexes, consider the following behaviors of indexes:
- Each index requires at least 8KB of data space.
- Adding an index has some negative performance impact for write operations. For collections with high write-to-read ratio, indexes are expensive as each insert must add keys to each index.
- Collections with high proportion of read operations to write operations often benefit from additional indexes. Indexes do not affect un-indexed read operations.
See Indexing Strategies for more information on determining indexes. Additionally, the MongoDB database profiler may help identify inefficient queries.
Sharding¶
Sharding allows users to partition a
collection within a database to distribute the collection’s
documents across a number of mongod
instances or
shards.
The shard key determines how MongoDB distributes data among shards in a sharded collection. Selecting the proper shard key has significant implications for performance.
See Sharded Cluster Overview for more information on sharding and the selection of the shard key.
Document Growth¶
Certain updates to documents can increase the document size, such as pushing elements to an array and adding new fields. If the document size exceeds the allocated space for that document, MongoDB relocates the document on disk. This internal relocation can be both time and resource consuming.
Although MongoDB automatically provides padding to minimize the occurrence of relocations, you may still need to manually handle document growth. Refer to Pre-Aggregated Reports Use Case Study for an example of the Pre-allocation approach to handle document growth.
Data Modeling Patterns and Examples¶
The following documents provide overviews of various data modeling patterns and common schema design considerations:
- Model Embedded One-to-One Relationships Between Documents
- Model Embedded One-to-Many Relationships Between Documents
- Model Referenced One-to-Many Relationships Between Documents
- Model Data for Atomic Operations
- Model Tree Structures with Parent References
- Model Tree Structures with Child References
- Model Tree Structures with Materialized Paths
- Model Tree Structures with Nested Sets
For more information and examples of real-world data modeling, consider the following external resources: