Performance Best Practices: Indexing

Mat Keep and Henrik Ingo
February 11, 2020 | Updated: October 2, 2023
#indexing

This post is also available in: Deutsch, Français, Español, Português

Welcome to the third in our series of blog posts covering performance best practices for MongoDB.

In this series, we are covering key considerations for achieving performance at scale across a number of important dimensions, including:

Data modeling and sizing memory (the working set)
Query patterns and profiling
Indexing, which we’ll cover today
Sharding
Transactions and read/write concerns
Hardware and OS configuration
Benchmarking

Having both worked for a couple of different database vendors over the past 15 years, we can safely say that failing to define the appropriate indexes is the number one performance issue technical support teams have to address with users.

So we need to get it right…..here are the best practices to help you.

Indexes in MongoDB

In any database, indexes support the efficient execution of queries. Without them, the database must scan every document in a collection or table to select those that match the query statement. If an appropriate index exists for a query, the database can use the index to limit the number of documents it must inspect.

MongoDB offers a broad range of index types and features with language-specific sort orders to support complex access patterns to your data. MongoDB indexes can be created and dropped on-demand to accommodate evolving application requirements and query patterns and can be declared on any field within your documents, including fields nested within arrays.

So let's cover how you make the best use of indexes in MongoDB.

Use Compound Indexes

Compound indexes are indexes composed of several different fields. For example, instead of having one index on "Last name" and another on "First name", it is typically most efficient to create an index that includes both "Last name" and "First name" if you query against both of the names. Our compound index can still be used to filter queries that specify the last name only.

Follow the ESR rule

For compound indexes, this rule of thumb is helpful in deciding the order of fields in the index:

First, add those fields against which Equality queries are run.
The next fields to be indexed should reflect the Sort order of the query.
The last fields represent the Range of data to be accessed.

Use Covered Queries When Possible

Covered queries return results from an index directly without having to access the source documents, and are therefore very efficient.

For a query to be covered all the fields needed for filtering, sorting and/or being returned to the client must be present in an index. To determine whether a query is a covered query, use the explain() method. If the explain() output displays totalDocsExamined as 0, this shows the query is covered by an index. Read more in the documentation for explain results.

A common gotcha when trying to achieve covered queries is that the _id field is always returned by default. You need to explicitly exclude it from query results, or add it to the index.

In sharded clusters, MongoDB internally needs to access the fields of the shard key. This means covered queries are only possible when the shard key is part of the index. It is usually a good idea to do this anyway.

Use Caution When Considering Indexes on Low-Cardinality Fields

Queries on fields with a small number of unique values (low cardinality) can return large result sets. Compound indexes may include fields with low cardinality, but the value of the combined fields should exhibit high cardinality.

Eliminate Unnecessary Indexes

Indexes are resource-intensive: even with compression in the MongoDB WiredTiger storage engine, they consume RAM and disk. As fields are updated, associated indexes must be maintained, incurring additional CPU and disk I/O overhead.

MongoDB provides tooling to help you understand index usage, which we will cover later in this post.

Wildcard Indexes Are Not a Replacement for Workload-Based Index Planning

For workloads with many ad-hoc query patterns or that handle highly polymorphic document structures, wildcard indexes give you a lot of extra flexibility. You can define a filter that automatically indexes all matching fields, subdocuments, and arrays in a collection.

As with any index, they also need to be stored and maintained, so they will add overhead to the database. If your application’s query patterns are known in advance, then you should use more selective indexes on the specific fields accessed by the queries.

Use text search to match words inside a field

Regular indexes are useful for matching the entire value of a field. If you only want to match on a specific word in a field with a lot of text, then use a text index.

If you are running MongoDB in the Atlas service, consider using Atlas Full Text Search which provides a fully-managed Lucene index integrated with the MongoDB database. FTS provides higher performance and greater flexibility to filter, rank, and sort through your database to quickly surface the most relevant results to your users.

Use Partial Indexes

Reduce the size and performance overhead of indexes by only including documents that will be accessed through the index. For example, create a partial index on the orderID field that only includes order documents with an orderStatus of "In progress", or only indexes the emailAddress field for documents where it exists.

Take Advantage of Multi-Key Indexes for Querying Arrays

If your query patterns require accessing individual array elements, use a multi-key index. MongoDB creates an index key for each element in the array and can be constructed over arrays that hold both scalar values and nested documents.

Avoid Regular Expressions That Are Not Left Anchored or Rooted

Indexes are ordered by value. Leading wildcards are inefficient and may result in full index scans. Trailing wildcards can be efficient if there are sufficient case-sensitive leading characters in the expression.

Avoid Case Insensitive Regular Expressions

If the sole reason for using a regex is case insensitivity, use a case insensitive index instead, as those are faster.

Use Index Optimizations Available in the WiredTiger Storage Engine

If you are self-managing MongoDB, you can optionally place indexes on their own separate volume, allowing for faster disk paging and lower contention. See wiredTiger options for more information.

Use the Explain Plan

We covered the use of MongoDB’s explain plan in the previous query patterns and profiling post, and this is the best tool to check on index coverage for individual queries.

Working from the explain plan, MongoDB provides visualization tools to help further improve understanding of your indexes, and which provides intelligent and automatic recommendations on which indexes to add.

Visualize Index Coverage With MongoDB Compass and Atlas Data Explorer

As the free GUI for MongoDB, Compass provides many features to help you optimize query performance, including exploring your schema and visualizing query explain plans – two areas covered previously in this series.

The indexes tab in Compass adds another tool to your arsenal. It lists the existing indexes for a collection, reporting the name and keys of the index, along with its type, size, and any special properties. Through the index tab you can also add and drop indexes as needed.

Figure 1: Managing indexes with MongoDB Compass

A really useful feature is index usage, which shows you how often an index has been used. Having too many indexes can be almost as damaging to your performance as having too few, making this feature is especially valuable in helping you identify and remove indexes that are not being used. This helps you free working set space, and eliminates database overhead that comes from maintaining the index.

If you are running MongoDB in our fully-managed Atlas service, the indexes view in the Data Explorer will give you the same functionality as Compass, without you having to connect to your database with a separate tool.

You can also retrieve index statistics using the $indexStats aggregation pipeline stage.

Automated Index Recommendations

Even with all of the telemetry provided by MongoDB’s tools, you are still responsible for pulling and analyzing the required data to make decisions on which indexes to add.

The threshold for slow queries varies based on the average time of operations on your cluster to provide recommendations pertinent to your workload.

Recommended indexes are accompanied by sample queries, grouped by query shape (i.e., queries with a similar predicate structure, sort, and projection), that were run against a collection that would benefit from the addition of a suggested index. The Performance Advisor does not negatively affect the performance of your Atlas clusters.

If you are happy with the recommendation, you can then roll out the new indexes automatically, without incurring any application downtime.

What’s Next

That wraps up this latest installment of the performance best practices series. MongoDB University offers a no-cost, web-based training course on MongoDB performance. This is a great way to learn more about the power of indexing.

Next up in this series: sharding.

← Previous

New Major Version of Terraform MongoDB Atlas Provider

Announcing a new version of Terraform MongoDB Atlas Provider.

February 7, 2020

Next →

Retrieval Augmented Generation for Claim Processing: Combining MongoDB Atlas Vector Search and Large Language Models

Following up on our previous blog, AI, Vectors, and the Future of Claims Processing: Why Insurance Needs to Understand The Power of Vector Databases , we’ll pick up the conversation right where we left it. We discussed extensively how Atlas Vector Search can benefit the claim process in insurance and briefly covered Retrieval Augmented Generation (RAG) and Large Language Models (LLMs). MongoDB.local NYC Join us in person on May 2, 2024 for our keynote address, announcements, and technical sessions to help you build and deploy mission-critical applications at scale. Use Code Web50 for 50% off your ticket! Learn More One of the biggest challenges for claim adjusters is pulling and aggregating information from disparate systems and diverse data formats. PDFs of policy guidelines might be stored in a content-sharing platform, customer information locked in a legacy CRM, and claim-related pictures and voice reports in yet another tool. All of this data is not just fragmented across siloed sources and hard to find but also in formats that have been historically nearly impossible to index with traditional methods. Over the years, insurance companies have accumulated terabytes of unstructured data in their data stores but have failed to capitalize on the possibility of accessing and leveraging it to uncover business insights, deliver better customer experiences, and streamline operations. Some of our customers even admit they’re not fully aware of all the data in their archives. There’s a tremendous opportunity to leverage this unstructured data to benefit the insurer and its customers. Our image search post covered part of the solution to these challenges, opening the door to working more easily with unstructured data. RAG takes it a step further, integrating Atlas Vector Search and LLMs, thus allowing insurers to go beyond the limitations of baseline foundational models, making them context-aware by feeding them proprietary data. Figure 1 shows how the interaction works in practice: through a chat prompt, we can ask questions to the system, and the LLM returns answers to the user and shows what references it used to retrieve the information contained in the response. Great! We’ve got a nice UI, but how can we build an RAG application? Let’s open the hood and see what’s in it! Figure 1: UI of the claim adjuster RAG-powered chatbot Architecture and flow Before we start building our application, we need to ensure that our data is easily accessible and in one secure place. Operational Data Layers (ODLs) are the recommended pattern for wrangling data to create single views. This post walks the reader through the process of modernizing insurance data models with Relational Migrator, helping insurers migrate off legacy systems to create ODLs. Once the data is organized in our MongoDB collections and ready to be consumed, we can start architecting our solution. Building upon the schema developed in the image search post , we augment our documents by adding a few fields that will allow adjusters to ask more complex questions about the data and solve harder business challenges, such as resolving a claim in a fraction of the time with increased accuracy. Figure 2 shows the resulting document with two highlighted fields, “claimDescription” and its vector representation, “claimDescriptionEmbedding” . We can now create a Vector Search index on this array, a key step to facilitate retrieving the information fed to the LLM. Figure 2: document schema of the claim collection, the highlighted fields are used to retrieve the data that will be passed as context to the LLM Having prepared our data, building the RAG interaction is straightforward; refer to this GitHub repository for the implementation details. Here, we’ll just discuss the high-level architecture and the data flow, as shown in Figure 3 below: The user enters the prompt, a question in natural language. The prompt is vectorized and sent to Atlas Vector Search; similar documents are retrieved. The prompt and the retrieved documents are passed to the LLM as context. The LLM produces an answer to the user (in natural language), considering the context and the prompt. Figure 3: RAG architecture and interaction flow It is important to note how the semantics of the question are preserved throughout the different steps. The reference to “adverse weather” related accidents in the prompt is captured and passed to Atlas Vector Search, which surfaces claim documents whose claim description relates to similar concepts (e.g., rain) without needing to mention them explicitly. Finally, the LLM consumes the relevant documents to produce a context-aware question referencing rain, hail, and fire, as we’d expect based on the user's initial question. So what? To sum it all up, what’s the benefit of combining Atlas Vector Search and LLMs in a Claim Processing RAG application? Speed and accuracy: Having the data centrally organized and ready to be consumed by LLMs, adjusters can find all the necessary information in a fraction of the time. Flexibility: LLMs can answer a wide spectrum of questions, meaning applications require less upfront system design. There is no need to build custom APIs for each piece of information you’re trying to retrieve; just ask the LLM to do it for you. Natural interaction: Applications can be interrogated in plain English without programming skills or system training. Data accessibility: Insurers can finally leverage and explore unstructured data that was previously hard to access. Not just claim processing The same data model and architecture can serve additional personas and use cases within the organization: Customer Service: Operators can quickly pull customer data and answer complex questions without navigating different systems. For example, “Summarize this customer's past interactions,” “What coverages does this customer have?” or “What coverages can I recommend to this customer?” Customer self-service: Simplify your members’ experience by enabling them to ask questions themselves. For example, “My apartment is flooded. Am I covered?” or “How long do windshield repairs take on average?” Underwriting: Underwriters can quickly aggregate and summarize information, providing quotes in a fraction of the time. For example, “Summarize this customer claim history.” “I Am renewing a customer policy. What are the customer's current coverages? Pull everything related to the policy entity/customer. I need to get baseline info. Find relevant underwriting guidelines.” If you would like to discover more about Converged AI and Application Data Stores with MongoDB, take a look at the following resources: RAG for claim processing GitHub repository From Relational Databases to AI: An Insurance Data Modernization Journey Modernize your insurance data models with MongoDB and Relational Migrator

April 18, 2024