MongoDB Developer Center
Developer Topics

Massive Arrays

Lauren Schaefer, Daniel CoupalPublished Feb 12, 2022 • Updated May 31, 2022
facebook icontwitter iconlinkedin icon
random alt
Rate this article
Design patterns are a fundamental part of software engineering. They provide developers with best practices and a common language as they architect applications.
Sometimes, developers jump right into designing their schemas and building their apps without thinking about best practices. As their apps begin to scale, they realize that things are bad.
Leslie says 'This is bad'
We've identified several common mistakes developers make with MongoDB. We call these mistakes "schema design anti-patterns."
Throughout this blog series, I'll introduce you to six common anti-patterns. Let's start today with the Massive Arrays anti-pattern.
Prefer to learn by video? I've got you covered.

Massive Arrays

One of the rules of thumb when modeling data in MongoDB is data that is accessed together should be stored together. If you'll be retrieving or updating data together frequently, you should probably store it together. Data is commonly stored together by embedding related information in subdocuments or arrays.
The problem is that sometimes developers take this too far and embed massive amounts of information in a single document.
Consider an example where we store information about employees who work in various government buildings. If we were to embed the employees in the building document, we might store our data in a buildings collection like the following:
In this example, the employees array is unbounded. As we begin storing information about all of the employees who work in City Hall, the employees array will become massive—potentially sending us over the
16 mb document maximum
. Additionally, reading and building indexes on arrays gradually becomes less performant as array size increases.
The example above is an example of the massive arrays anti-pattern.
So how can we fix this?
Instead of embedding the employees in the buildings documents, we could flip the model and instead embed the buildings in the employees documents:
In the example above, we are repeating the information about City Hall in the document for each City Hall employee. If we are frequently displaying information about an employee and their building in our application together, this model probably makes sense.
The disadvantage with this approach is we have a lot of data duplication. Storage is cheap, so data duplication isn't necessarily a problem from a storage cost perspective. However, every time we need to update information about City Hall, we'll need to update the document for every employee who works there. If we take a look at the information we're currently storing about the buildings, updates will likely be very infrequent, so this approach may be a good one.
If our use case does not call for information about employees and their building to be displayed or updated together, we may want to instead separate the information into two collections and
use references
to link them:
Here we have completely separated our data. We have eliminated massive arrays, and we have no data duplication.
The drawback is that if we need to retrieve information about an employee and their building together, we'll need to use $lookup to join the data together. $lookup operations can be expensive, so it's important to consider how often you'll need to perform $lookup if you choose this option.
If we find ourselves frequently using
, another option is to use the
extended reference pattern
. The extended reference pattern is a mixture of the previous two approaches where we duplicate some—but not all—of the data in the two collections. We only duplicate the data that is frequently accessed together.
For example, if our application has a user profile page that displays information about the user as well as the name of the building and the state where they work, we may want to embed the building name and state fields in the employee document:
As we saw when we duplicated data previously, we should be mindful of duplicating data that will frequently be updated. In this particular case, the name of the building and the state the building is in are very unlikely to change, so this solution works.


Storing related information that you'll be frequently querying together is generally good. However, storing information in massive arrays that will continue to grow over time is generally bad.
As is true with all MongoDB schema design patterns and anti-patterns, carefully consider your use case—the data you will store and how you will query it—in order to determine what schema design is best for you.
Be on the lookout for more posts in this anti-patterns series in the coming weeks.
When you're ready to build a schema in MongoDB, check out
MongoDB Atlas
, MongoDB's fully managed database-as-a-service. Atlas is the easiest way to get started with MongoDB. With a forever-free tier, you're on your way to realizing the full value of MongoDB.
Check out the following resources for more information:

Copy Link
facebook icontwitter iconlinkedin icon
Rate this article
Kafka to MongoDB Atlas End to End Tutorial

Jan 25, 2022
Currency Analysis with Time Series Collections #1 — Generating Candlestick Charts Data

Aug 27, 2021
Java - Mapping POJOs

Feb 01, 2022
News & Announcements
MongoDB's New Time Series Collections

Jul 13, 2021
Table of Contents
  • Massive Arrays