Rate this article
Design patterns are a fundamental part of software engineering. They provide developers with best practices and a common language as they architect applications.
Sometimes, developers jump right into designing their schemas and building their apps without thinking about best practices. As their apps begin to scale, they realize that things are bad.
We've identified several common mistakes developers make with MongoDB. We call these mistakes "schema design anti-patterns."
Throughout this blog series, I'll introduce you to six common anti-patterns. Let's start today with the Massive Arrays anti-pattern.
Prefer to learn by video? I've got you covered.
One of the rules of thumb when modeling data in MongoDB is data that is accessed together should be stored together. If you'll be retrieving or updating data together frequently, you should probably store it together. Data is commonly stored together by embedding related information in subdocuments or arrays.
The problem is that sometimes developers take this too far and embed massive amounts of information in a single document.
Consider an example where we store information about employees who work in various government buildings. If we were to embed the employees in the building document, we might store our data in a buildings collection like the following:
In this example, the employees array is unbounded. As we begin storing information about all of the employees who work in City Hall, the employees array will become massive—potentially sending us over the . Additionally, reading and building indexes on arrays gradually becomes less performant as array size increases.
The example above is an example of the massive arrays anti-pattern.
So how can we fix this?
Instead of embedding the employees in the buildings documents, we could flip the model and instead embed the buildings in the employees documents:
In the example above, we are repeating the information about City Hall in the document for each City Hall employee. If we are frequently displaying information about an employee and their building in our application together, this model probably makes sense.
The disadvantage with this approach is we have a lot of data duplication. Storage is cheap, so data duplication isn't necessarily a problem from a storage cost perspective. However, every time we need to update information about City Hall, we'll need to update the document for every employee who works there. If we take a look at the information we're currently storing about the buildings, updates will likely be very infrequent, so this approach may be a good one.
Here we have completely separated our data. We have eliminated massive arrays, and we have no data duplication.
The drawback is that if we need to retrieve information about an employee and their building together, we'll need to use $lookup to join the data together. $lookup operations can be expensive, so it's important to consider how often you'll need to perform $lookup if you choose this option.
For example, if our application has a user profile page that displays information about the user as well as the name of the building and the state where they work, we may want to embed the building name and state fields in the employee document:
As we saw when we duplicated data previously, we should be mindful of duplicating data that will frequently be updated. In this particular case, the name of the building and the state the building is in are very unlikely to change, so this solution works.
Storing related information that you'll be frequently querying together is generally good. However, storing information in massive arrays that will continue to grow over time is generally bad.
As is true with all MongoDB schema design patterns and anti-patterns, carefully consider your use case—the data you will store and how you will query it—in order to determine what schema design is best for you.
Be on the lookout for more posts in this anti-patterns series in the coming weeks.
Check out the following resources for more information:
Currency Analysis with Time Series Collections #2 — Simple Moving Average and Exponential Moving Average Calculation
May 16, 2022