Navigation
This version of the documentation is archived and no longer supported.

Data Modeling Considerations for MongoDB Applications

Overview

Data in MongoDB has a flexible schema. Collections do not enforce document structure. This means that:

  • documents in the same collection do not need to have the same set of fields or structure, and
  • common fields in a collection’s documents may hold different types of data.

Each document only needs to contain relevant fields to the entity or object that the document represents. In practice, most documents in a collection share a similar structure. Schema flexibility means that you can model your documents in MongoDB so that they can closely resemble and reflect application-level objects.

As in all data modeling, when developing data models (i.e. schema designs,) for MongoDB you must consider the inherent properties and requirements of the application objects and the relationships between application objects. MongoDB data models must also reflect:

  • how data will grow and change over time, and
  • the kinds of queries your application will perform.

These considerations and requirements force developers to make a number of multi-factored decisions when modeling data, including:

  • normalization and de-normalization.

    These decisions reflect degree to which the data model should store related pieces of data in a single document or should the data model describe relationships using references between documents.

  • indexing strategy.

  • representation of data in arrays in BSON.

Although a number of data models may be functionally equivalent for a given application; however, different data models may have significant impacts on MongoDB and applications performance.

This document provides a high level overview of these data modeling decisions and factors. In addition, consider, the Data Modeling Patterns and Examples section which provides more concrete examples of all the discussed patterns.

Data Modeling Decisions

Data modeling decisions involve determining how to structure the documents to model the data effectively. The primary decision is whether to embed or to use references.

Embedding

To de-normalize data, store two related pieces of data in a single document.

Operations within a document are less expensive for the server than operations that involve multiple documents.

In general, use embedded data models when:

Embedding provides the following benefits:

  • generally better performance for read operations.
  • the ability to request and retrieve related data in a single database operation.

Embedding related data in documents, can lead to situations where documents grow after creation. Document growth can impact write performance and lead to data fragmentation. Furthermore, documents in MongoDB must be smaller than the maximum BSON document size. For larger documents, consider using GridFS.

See also

  • dot notation for information on “reaching into” embedded sub-documents.
  • Arrays for more examples on accessing arrays.
  • Subdocuments for more examples on accessing subdocuments.

Referencing

To normalize data, store references between two documents to indicate a relationship between the data represented in each document.

In general, use normalized data models:

  • when embedding would result in duplication of data but would not provide sufficient read performance advantages to outweigh the implications of the duplication.
  • to represent more complex many-to-many relationships.
  • to model large hierarchical data sets. See Model Tree Structures in MongoDB.

Referencing provides more flexibility than embedding; however, to resolve the references, client-side applications must issue follow-up queries. In other words, using references requires more roundtrips to the server.

See Model Referenced One-to-Many Relationships Between Documents for an example of referencing.

Atomicity

MongoDB only provides atomic operations on the level of a single document. [1] As a result needs for atomic operations influence decisions to use embedded or referenced relationships when modeling data for MongoDB.

Embed fields that need to be modified together atomically in the same document. See Model Data for Atomic Operations for an example of atomic updates within a single document.

[1]Document-level atomic operations include all operations within a single MongoDB document record: operations that affect multiple sub-documents within that single record are still atomic.

Operational Considerations

In addition to normalization and normalization concerns, a number of other operational factors help shape data modeling decisions in MongoDB. These factors include:

  • data lifecycle management,
  • number of collections and
  • indexing requirements,
  • sharding, and
  • managing document growth.

These factors implications for database and application performance as well as future maintenance and development costs.

Data Lifecycle Management

Data modeling decisions should also take data lifecycle management into consideration.

The Time to Live or TTL feature of collections expires documents after a period of time. Consider using the TTL feature if your application requires some data to persist in the database for a limited period of time.

Additionally, if your application only uses recently inserted documents consider Capped Collections. Capped collections provide first-in-first-out (FIFO) management of inserted documents and optimized to support operations that insert and read documents based on insertion order.

Large Number of Collections

In certain situations, you might choose to store information in several collections rather than in a single collection.

Consider a sample collection logs that stores log documents for various environment and applications. The logs collection contains documents of the following form:

{ log: "dev", ts: ..., info: ... }
{ log: "debug", ts: ..., info: ...}

If the total number of documents is low you may group documents into collection by type. For logs, consider maintaining distinct log collections, such as logs.dev and logs.debug. The logs.dev collection would contain only the documents related to the dev environment.

Generally, having large number of collections has no significant performance penalty and results in very good performance. Distinct collections are very important for high-throughput batch processing.

When using models that have a large number of collections, consider the following behaviors:

  • Each collection has a certain minimum overhead of a few kilobytes.
  • Each index, including the index on _id, requires at least 8KB of data space.

A single <database>.ns file stores all meta-data for each database. Each index and collection has its own entry in the namespace file, MongoDB places limits on the size of namespace files.

Because of limits on namespaces, you may wish to know the current number of namespaces in order to determine how many additional namespaces the database can support, as in the following example:

db.system.namespaces.count()

The <database>.ns file defaults to 16 MB. To change the size of the <database>.ns file, pass a new size to --nssize option <new size MB> on server start.

The --nssize sets the size for new <database>.ns files. For existing databases, after starting up the server with --nssize, run the db.repairDatabase() command from the mongo shell.

Indexes

Create indexes to support common queries. Generally, indexes and index use in MongoDB correspond to indexes and index use in relational database: build indexes on fields that appear often in queries and for all operations that return sorted results. MongoDB automatically creates a unique index on the _id field.

As you create indexes, consider the following behaviors of indexes:

  • Each index requires at least 8KB of data space.
  • Adding an index has some negative performance impact for write operations. For collections with high write-to-read ratio, indexes are expensive as each insert must add keys to each index.
  • Collections with high proportion of read operations to write operations often benefit from additional indexes. Indexes do not affect un-indexed read operations.

See Indexing Strategies for more information on determining indexes. Additionally, the MongoDB database profiler may help identify inefficient queries.

Sharding

Sharding allows users to partition a collection within a database to distribute the collection’s documents across a number of mongod instances or shards.

The shard key determines how MongoDB distributes data among shards in a sharded collection. Selecting the proper shard key has significant implications for performance.

See Sharded Cluster Overview for more information on sharding and the selection of the shard key.

Document Growth

Certain updates to documents can increase the document size, such as pushing elements to an array and adding new fields. If the document size exceeds the allocated space for that document, MongoDB relocates the document on disk. This internal relocation can be both time and resource consuming.

Although MongoDB automatically provides padding to minimize the occurrence of relocations, you may still need to manually handle document growth. Refer to Pre-Aggregated Reports Use Case Study for an example of the Pre-allocation approach to handle document growth.