Thinking in Documents: Part 2



In part 1 of this 2-part blog series, we introduced the concept of documents and some of the advantages they provide. In this part, we will start to put documents into action by discussing schema design. We will cover how to manage related data with embedding and referencing, we’ll touch on indexing and the MongoDB transaction model.

Defining Your Document Schema

You should start the schema design process by considering the application’s query requirements. The data should be modeled in a way that takes advantage of the document model’s flexibility. When migrating the data model between relational tables and documents, it may be easy to mirror the relational database’s flat schema to the document model. However, this approach negates the advantages enabled by the document model’s rich, embedded data structures.

The application’s data access patterns should govern schema design, with specific understanding of:

  • The read/write ratio of database operations.
  • The types of queries and updates performed by the database.
  • The life-cycle of the data and growth rate of documents.

If coming from a relational background a good first step is to identify the operations performed on the application’s data, comparing:

  1. How these would be implemented by a relational database;
  2. How MongoDB could implement them.

Figure 1 represents an example of this exercise.

Application RDBMS Action MongoDB Action
Create Product Record INSERT to (n) tables (product description, price, manufacturer, etc.) insert() to one document with sub-documents, arrays
Display Product Record SELECT and JOIN(n) product tables find() document
Add Product Review INSERT to "review" table, foreign key to product record insert() to "review" collection, reference to product document
More actions... ... ...
**Figure 1: Analyzing queries to design the optimum schema**

This analysis helps to identify the ideal document schema for the application data and workload, based on the queries and operations to be performed against it.

If migrating from a relational database, you can also identify the existing application's most common queries by analyzing the logs maintained by the RDBMS. This analysis identifies the data that is most frequently accessed together, and can therefore potentially be stored together within a single MongoDB document.

Modeling Relationships with Embedding and Referencing

Deciding when to embed a document or instead create a reference between separate documents in different collections is an application-specific consideration. There are, however, some general considerations to guide the decision during schema design.


Data with a 1:1 or 1:Many relationship (where the “many” objects always appear with, or are viewed in the context of their parent documents) is a natural candidate for embedding the referenced information within the parent document. The concept of data ownership and containment can also be modeled with embedding. Using the product data example above, product pricing – both current and historical – should be embedded within the product document since it is owned by and contained within that specific product. If the product is deleted, the pricing becomes irrelevant.

DBAs should also embed fields that need to be modified together atomically. To learn more, refer to the section below on the MongoDB transaction model.

Not all 1:1 or 1:Many relationships should be embedded in a single document. Instead, referencing between documents in different collections should be used when:

  • A document is frequently read, but contains an embedded document that is rarely accessed. An example might be a customer record that embeds copies of the annual general report. Embedding the report only increases the in-memory requirements (the working set) of the collection.
  • One part of a document is frequently updated and constantly growing in size, while the remainder of the document is relatively static.
  • The document size exceeds MongoDB’s current 16MB document limit.


Referencing enables data normalization, and can give more flexibility than embedding. But the application will issue follow-up queries to resolve the reference, requiring additional round-trips to the server.

References are usually implemented by saving the _id field of one document in the related document as a reference. A second query is then executed by the application to return the referenced data.

Referencing should be used:

  • When embedding would not provide sufficient read performance advantages to outweigh the implications of data duplication.
  • Where the object is referenced from many different sources.
  • To represent complex many-to-many relationships.
  • To model large, hierarchical data sets.

Different Design Goals

Comparing these two design options – embedding sub-documents versus referencing between documents – highlights a fundamental difference between relational and document databases:

  • The RDBMS optimizes data for storage efficiency (as it was conceived at a time when storage was the most expensive component of the system).
  • MongoDB’s document model is optimized for how the application accesses data (as developer time and speed to market are now more expensive than storage).

Data modeling considerations, patterns and examples including embedded versus referenced relationships are discussed in more detail in the documentation.

MongoDB Transaction Model

Relational databases typically have well-developed features for data integrity, including ACID transactions and constraint enforcement. Rightly, users do not want to sacrifice data integrity as they move to new types of databases. With MongoDB, users can maintain many capabilities of relational databases, even though the technical implementation of those capabilities may be different; we have already seen this in part 1 of the series where we discussed JOINs.

Since the early versions of MongoDB, MongoDB write operations have been ACID-compliant at the document level – including the ability to update embedded arrays and sub-documents atomically. By embedding related fields within a single document, users get the same integrity guarantees as a traditional RDBMS, which has to synchronize costly ACID operations and maintain referential integrity across separate tables.

Document-level ACID compliance in MongoDB ensures complete isolation as a document is updated; any errors cause the operation to roll back and clients receive a consistent view of the document.

Beginning in version 4.0, MongoDB has added support for multi-document ACID transactions. MongoDB transactions provide a globally consistent view of data across replica sets and enforce all-or-nothing execution to maintain data integrity.
For those using versions prior to 4.0 of MongoDB, there may be cases that require multi-document transactions despite the power of single-document atomic operations. There are multiple approaches to this – including using the findandmodify command that allows a document to be updated atomically and returned in the same round trip. findandmodify is a powerful primitive on top of which users can build other more complex transaction protocols. For example, users frequently build atomic soft-state locks, job queues, counters and state machines that can help coordinate more complex behaviors. Another alternative entails implementing a two-phase commit to provide transaction-like semantics.

How Can I Get to my Data?

Unlike a lot of other non-relational databases, MongoDB has a rich query model and powerful secondary indexes that provide flexibility in how data is accessed.

As with any database – relational or non-relational – indexes are the single biggest tunable performance factor and are therefore integral to schema design. Indexes in MongoDB largely correspond to indexes in a relational database. MongoDB uses B-Tree indexes, and natively supports secondary indexes. As such, it will be immediately familiar to those coming from a SQL background.

The type and frequency of the application’s queries should inform index selection. As with all databases, indexing does not come free: it imposes overhead on writes and resource (disk and memory) usage.

By default, MongoDB creates an index on the document’s _id primary key field. All user-defined indexes are secondary indexes. Any field can be used for a secondary index, including fields within arrays. Index options for MongoDB include:

  • Compound Indexes
  • Geospatial Indexes
  • Text Search Indexes
  • Unique Indexes
  • Array Indexes
  • TTL Indexes
  • Sparse Indexes
  • Hash Indexes

MongoDB also supports index intersection, allowing the use of multiple indexes to fulfill a query.

Next Steps

Having read this far, hopefully you now have a pretty good idea on how to “think in documents,” rather than tables. But you probably want to learn more. Take a look at the Thinking in Documents webinar.

<< Read Part 1

To look at specific considerations in moving from relational databases, download the guide below.

Download the RDBMS Migration Guide

This guide goes into more detail on:

  • The different types of secondary indexes in MongoDB
  • The aggregation framework, providing similar functionality to the GROUP_BY and related SQL statements, in addition to enabling in-database transformations
  • Implementation validation and constraints
  • Best practices for migrating data from relational tables to documents