Performance Best Practices: MongoDB Data Modeling and Memory Sizing

Mat Keep and Henrik Ingo
January 28, 2020 | Updated: July 21, 2020
#data model

Welcome to the first in a series of blog posts covering performance best practices for MongoDB. In this series, we will cover key considerations for achieving performance at scale across a number of important dimensions, including:

Data modeling and sizing memory (the working set)
Query patterns and profiling
Indexing
Sharding
Transactions and read/write concerns
Hardware and OS configuration
Benchmarking

Who is this series for?

While the best practices we are going to cover are not exhaustive, there will be recommendations in this series that will be useful to you whether you are:

Starting your first project or are a seasoned MongoDB developer.
Running MongoDB on Atlas, our fully-managed and global cloud database service, or managing MongoDB yourself.

What are we covering in this post?

We’re going to start with two key considerations that will set up all of the performance best practices we discuss through the rest of this blog series. Firstly we’ll discuss schema design and the key resources to get you started, before then moving onto RAM sizing for your application’s most frequently accessed data and indexes. This is something we call the “working set”.

Data Modeling Matters

The first step in performance optimization is to understand your application’s query patterns so that you design your data model and select the appropriate indexes accordingly. Tailoring your data model to the application’s query patterns produces more efficient queries, increases the throughput of insert and update operations, and more effectively distributes your workload across a sharded cluster.

Just because MongoDB has a flexible schema does not mean you can ignore schema design! While you can modify your schema at any time, applying schema design best practices at the outset of your project will save potential refactoring later on.

A major advantage of JSON documents is that you have the flexibility to model your data any way your application needs. The nesting of arrays and subdocuments makes documents very powerful at modeling complex relationships between data. But you can also model flat, tabular and columnar structures, simple key-value pairs, text, geospatial and time-series data, or the nodes and edges of connected graph data structures. The optimal schema design will be determined by your application’s query patterns.

Key Considerations and Resources for Data Modeling

As you design your data model, one of the first decisions you need to make is how to model relationships between data. Deciding when to embed a document or instead create a reference between separate documents in different collections is an application-specific consideration. There are, however, some general considerations to guide the decision during schema design.

Embedding

Data with a 1:1 relationship is an obvious and natural candidate for embedding within a single document. Data with a 1:many relationship where the "many" objects always appear with or are viewed in the context of their parent documents are also best served by embedding. Because this data is always accessed together, storing it together in the same document is optimal.

As a result of this data locality, embedding generally provides better performance for read operations due to the ability to request and retrieve related data in a single internal database operation, rather than looking up documents stored in different collections. Embedded data models also make it possible to update related data in a single atomic write operation because single document writes are transactional.

However, not all 1:1 and 1:many relationships are suitable for embedding in a single document. Referencing between documents in different collections should be used when:

A document is frequently read but contains data that is rarely accessed. Embedding this data only increases the in-memory requirements (the working set) of the collection.
One part of a document is frequently updated and constantly growing in size, while the remainder of the document is relatively static.
The combined document size would exceed MongoDB’s 16MB document limit, for example when modeling many:1 relationships, such as product reviews to product.

Referencing

Referencing can help address the challenges cited above and is also typically used when modeling many:many relationships. However, the application will need to issue follow-up queries to resolve the reference, requiring additional round-trips to the server, or require a “joining” operation using MongoDB’s $lookup aggregation pipeline stage.

Digging Deeper

Data modeling is an expansive topic and has filled previous blog series on its own. To help you make the right decisions, here is a summary of the key resources you should review:

The MongoDB documentation provides an extensive section on data modeling, starting from high-level concepts of the document data model before progressing to practical examples and design patterns, including more detail on referencing and embedding.
You should also review our Building with Patterns blog series to learn more about specific schema design best practices for different use cases, including catalog and content management, IoT, mobile apps, analytics, and single view (i.e. customer 360). It overlays these use cases with specific schema design patterns such as versioning, bucketing, referencing, and graphs.
MongoDB University offers a no-cost, web-based training course on data modeling. This is a great way to kick-start your learning on schema design with the document data model.

Review your Data Model

Once you have developed an initial data model and began populating it with sample application data, it is helpful to be able to review it.

MongoDB Compass is the free GUI for MongoDB. You can do a lot with Compass, so it’s a tool we’ll come back to regularly during this blog series. One of its most useful features is schema visualization, enabling you to explore your schema with histograms that show your documents’ fields, data types, and values. As you’ll see later in the series, you can also visualize query explain plans and index coverage straight from the Compass UI.

Example of schema visualization provided by MongoDB Compass — Figure 1: Visualizing your schema in MongoDB Compass

In Figure 1, we are reviewing the schema for documents stored in the restaurants collection. For the documents sampled, Compass displays how frequently fields appear in each document, the range of values they contain along with the data types, and the number of elements in the categories array. The Compass documentation has more detail on how to analyze your schema.

Compass can connect to MongoDB instances you are managing yourself or your cloud databases on MongoDB Atlas. You can also view document structures directly from the Atlas UI using the Data Explorer, or “Collections” view.

Getting Started with Documents

The best way to explore and experiment with data modeling is to spin up MongoDB on the fully-managed Atlas cloud service.

Our documentation steps you through how to create a free MongoDB database cluster in the region and on the cloud provider of your choice. You can also load our sample datasets, providing you with a simple way of getting started with documents.

Memory Sizing: Ensure your working set fits in RAM

Beyond data modeling, the second major consideration in performance optimization in sizing your working set.

As with most databases, MongoDB performs best when the application’s working set (indexes and most frequently accessed data) fits in memory. RAM size is the most important factor for instance sizing; other optimizations may not significantly improve the performance of the database if there is insufficient RAM. If price/performance is more of a priority over performance alone, then using fast SSDs to complement smaller amounts of RAM is a viable design choice. You should test the optimum balance for your workload and SLAs.

When the application’s working set fits in RAM, read activity from disk will be low. You can analyze this with the tools we will cover in the Query Profiling blog post, which is next up in this series.

If your working set exceeds the RAM of your chosen instance size or server, consider moving to a larger instance with more memory or partition (shard) your database across multiple servers.

Properly sizing the working set holds true whether you run MongoDB on Atlas or manage MongoDB yourself.

Review the Atlas sizing and tier selection documentation for guidance on calculating your working set size.
We will dig deeper into hardware sizing for self-managed MongoDB later in this series.

In MongoDB Atlas scaling compute and storage is straightforward. You can opt into cluster tier auto-scaling, which adjusts compute capacity for you in response to changes in application demand.

Cluster tier auto-scaling in Atlas monitors both CPU and memory utilization over defined periods and scales your instance sizes up or down within the limits you configure. All scaling events are performed in a rolling fashion, so there is no impact on your applications. At the time of writing, auto-scaling is a beta feature. If you want to control scaling events yourself, then you can do that in just a few clicks in the Atlas UI, or via an API call.

What’s Next

That wraps up our first post in this performance best practices series. Next up: query patterns and profiling.

← Previous

Quick Start: Java and MongoDB - Mapping POJOs

Learn how to use the native mapping of POJOs using the MongoDB Java Driver.

January 23, 2020

Next →

How the NFSA is Using MongoDB Atlas and AI to Make Aussie Culture Accessible

Where can you find everything from facts about Kylie Minogue, to more than 6,000 Australian home movies, to a 60s pop group playing a song with a drum-playing kangaroo ? The NFSA! Founded in 1935, the National Film and Sound Archive of Australia (NFSA) is one of the oldest archives of its kind in the world. It is tasked with collecting, preserving, and sharing Australia’s audiovisual culture. According to its website, the NFSA “represents not only [Australia’s] technical and artistic achievements, but also our stories, obsessions and myths; our triumphs and sorrows; who we were, are, and want to be.” The NFSA’s collection includes petabytes of audiovisual data—including broadcast-quality news footage, TV shows, and movies, high-resolution photographs, radio shows, and video games—plus millions of physical and contextual items like costumes, scripts, props, photographs, and promotional materials, all tucked away in a warehouse. “Today, we have eight petabytes of data, and our data is growing from one to two petabytes each year,” said Shahab Qamar, software engineering manager at NFSA. Making this wealth of data easily accessible to users across Australia (not to mention all over the world) has led to a number of challenges, which is where MongoDB Atlas—which helps developers simplify and accelerate building with data—comes in. Don’t change (but apply a few updates) Because of its broad appeal, the NFSA's collection website alone receives an average of 100,000 visitors each month. When Qamar joined the NFSA in 2020, he saw an opportunity to improve the organization’s web platform. His aim was to ensure the best possible experience for the site’s high number of daily visitors, which had begun to plateau. This included a website refresh, as well as addressing technical issues related to handling site traffic, due to the site being hosted on on-premises servers. The site also wasn’t “optimized for Google Analytics,” said Qamar. In fact, the NFSA website was invisible to Google and other search engines, so he knew it was time for a significant update, which also presented an opportunity to set up strong data foundations to build deeper capabilities down the line. But first, Qamar and team needed to find a setup that could serve the needs of the NFSA and Australia’s 26 million residents more robustly than their previous solution. Specifically, Qamar said, the NFSA was looking for a fully managed database that could also implement search at scale, as well as a system that his small team of five could easily manage. It also needed to ensure high levels of resiliency and the ability to work with more than one cloud provider. The previous NFSA site also didn’t support content delivery networks , he added. MongoDB Atlas supported all of the use cases the NFSA was looking for, Qamar said, including the ability to support multi-cloud hosting. And because Atlas is fully managed, it would readily meet the NFSA's requirements. In July 2023, after months of development, the new and greatly improved NFSA website was launched in July 2023. The redesign was immediately impactful: Since the NFSA’s redesigned site was launched, the number of users visiting the collection search website has gone up 200%, and content requests—which the NFSA access team responds to on a case-by-case basis—have gone up 16%. (Getting search) back in black While the previous version of the NFSA site included search, the prior functionality was prone to crashing, and the quality of the results was often poor, Qamar said. For example, search results were delivered alphabetically rather than based on relevance, and the previous search didn’t support fine-tuning of relevance based on matches in specific fields. So, as part of its site redesign, the NFSA was looking to add full text search, relevance-based search results, faceting, and pagination. MongoDB Atlas Search —which integrates the database, search engine, and sync mechanism into a single, unified, fully managed platform—ticked all of those boxes. A search results page on the NFSA website Indeed, the NFSA compared search results from its old site to its new MongoDB Atlas site and “found that MongoDB Atlas-based searches were more relevant and targeted,” Qamar said. Previously, configuring site search required manual coding and meant downtime for the site, he noted. “The whole setup wasn’t very developer friendly and, therefore, a barrier to working efficiently with search configuration and fine-tuning,” Qamar said. In comparison, MongoDB Atlas allowed for simple configuration and fine-tuning of the NFSA's search requirements. The NFSA has also been using MongoDB Atlas Charts . Charts help the NFSA easily visualize its collection by custom grouping (like production year or genre), as well as helping the NFSA see which items are most popular with users. “Charts have helped us understand how our collection is growing and evolving over time,” Qamar said. NFSA’s use of MongoDB Charts Can’t get you (AI) out of my head Now, the NFSA—inspired by Qamar’s own training in machine learning and the broad interest in all things AI—is exploring how it can use Atlas Vector Search and generative AI tools to allow users to explore content buried in the NFSA collection. One example cited is putting transcriptions of audiovisual files in NFSA’s collection into a vector database for retrieval-augmented generation (RAG). The NFSA has approximately 27 years worth—meaning, it would take 27 years to play it all back—of material to transcribe, and is currently developing a model to accurately capture the Australian dialect so the work is transcribed correctly. Ultimately, the NFSA is interested in building a RAG-powered AI bot to provide historically and contextually accurate information about work in the NFSA’s archive. The NFSA is also exploring how it can use RAG to deliver accurate, conversation-like search results without training large language models itself, and whether it can leverage AI to help restore some of the older videos in its collection. Qamar and team are also interested in vectorizing audio-visual material for semantic analysis and genre-based classification of collection material at scale, he said. “Historically, we’ve been very metadata-driven and keyword-driven, and I think that’s a missed opportunity. Because when we talk about what an archive does, we archive stories,” Qamar said of the possibilities offered by vectors. “An example I use is, what if the world ended tomorrow? And what if aliens came to Earth and only saw our metadata, what image of Australia would they see? Is that a true image of what Australia is really like?” Qamar said. “How content is described is important, but content’s imagery, the people in it, and the audio and words being spoken are really important. Full-text search can take you somewhere along the way, but vector search allows you to look things up in a semantic manner. So it’s more about ideas and concepts than very specific keywords,” he said. If you’re interested in learning how MongoDB helps accelerate and simplify time-to-mission for federal, state, and local governments, defense agencies, education, and across the public sector, check out MongoDB for Public Sector . Check out MongoDB Atlas Vector Search to learn more about how Vector Search helps organizations like the NFSA build applications powered by semantic search and gen AI. *Note that this story’s subheads come from Australian song titles!

May 14, 2024