Rate this article
If your brain feels bloated from too much reading, sit back, relax, and watch this video.
Chances are pretty good that you want your queries to be blazing fast. MongoDB wants your queries to be blazing fast too.
To keep your queries running as quickly as possible,
(the default storage engine for MongoDB) keeps all of the indexes plus the documents that are accessed the most frequently in memory. We refer to these frequently accessed documents and index pages as the working set. When the working set fits in the RAM allotment, MongoDB can query from memory instead of from disk. Queries from memory are faster, so the goal is to keep your most popular documents small enough to fit in the RAM allotment.
The working set's RAM allotment is the larger of:
- 50% of (RAM - 1 GB)
- 256 MB.
One of the rules of thumb you'll hear frequently when discussing MongoDB schema design is data that is accessed together should be stored together. Note that it doesn't say data that is related to each other should be stored together.
Sometimes data that is related to each other isn't actually accessed together. You might have large, bloated documents that contain information that is related but not actually accessed together frequently. In that case, separate the information into smaller documents in separate collections and use
to connect those documents together.
Let's revisit Leslie's website for inspirational women that we discussed in the
. Leslie updates the home page to display a list of the names of 100 randomly selected inspirational women. When a user clicks on the name of an inspirational woman, they will be taken to a new page with all of the detailed biographical information about the woman they selected. Leslie fills the website with 4,704 inspirational women—including herself.
Initially, Leslie decides to create one collection named InspirationalWomen, and creates a document for each inspirational woman. The document contains all of the information for that woman. Below is a document she creates for Sally Ride.
Leslie notices that her home page is lagging. The home page is the most visited page on her site, and, if the page doesn't load quickly enough, visitors will abandon her site completely.
Leslie has two choices: she can restructure her data according to
to remove the bloated documents, or she can move up to a M20 dedicated cluster, which has 4 GB of RAM. Leslie considers her options and decides that having the home page and the most popular inspirational women's documents load quickly is most important. She decides that having the less frequently viewed women's pages take slightly longer to load is fine.
She begins determining how to restructure her data to optimize for performance. The query on Leslie's homepage only needs to retrieve each woman's first name and last name. Having this information in the working set is crucial. The other information about each woman (including a lengthy bio) doesn't necessarily need to be in the working set.
To ensure her home page loads at a blazing fast pace, she decides to break up the information in her
InspirationalWomencollection into two collections:
InspirationalWomen_Details. She creates a
between the matching documents in the collections. Below are her new documents for Sally Ride.
Leslie updates her query on the home page that retrieves each woman's first name and last name to use the
InspirationalWomen_Summarycollection. When a user selects a woman to learn more about, Leslie's website code will query for a document in the
InspirationalWomen_Detailscollection using the id stored in the
Leslie returns to Atlas and inspects the size of her databases and collections. She can see that the total index size for both collections is 276 KB (180 KB + 96 KB). She can also see that the size of her
InspirationalWomen_Summarycollection is about 455 KB. The sum of the indexes and this collection is about 731 KB, which is significantly less than her working set's RAM allocation of 0.5 GB. Because of this, many of the most popular documents from the
InspirationalWomen_Detailscollection will also fit in the working set.
In the example above, Leslie is duplicating all of the data from the
InspirationalWomen_Summarycollection in the
InspirationalWomen_Detailscollection. You might be cringing at the idea of data duplication. Historically, data duplication has been frowned upon due to space constraints as well as the challenges of keeping the data updated in both collections. Storage is relatively cheap, so we don't necessarily need to worry about that here. Additionally, the data that is duplicated is unlikely to change very often.
In most cases, you won't need to duplicate all of the information in more than one collection; you'll be able to store some of the information in one collection and the rest of the information in the other. It all depends on your use case and how you are using the data.
Be sure that the indexes and the most frequently used documents fit in the RAM allocation for your database in order to get blazing fast queries. If your working set is exceeding the RAM allocation, check if your documents are bloated with extra information that you don't actually need in the working set. Separate frequently used data from infrequently used data in different collections to optimize your performance.
Check back soon for the next post in this schema design anti-patterns series!
Check out the following resources for more information:
Currency Analysis with Time Series Collections #2 — Simple Moving Average and Exponential Moving Average Calculation
May 16, 2022