MongoDB Engineering Blog

Carrying Complexity, Delivering Agility

September 25, 2025

Engineering Blog

Token-count-based Batching: Faster, Cheaper Embedding Inference for Queries

Motivation

December 19, 2025

Engineering Blog

LEAF: Distillation of State‑of‑the‑Art Text Embedding Models

At MongoDB Research, we’re exploring how to make cutting‑edge AI models smaller, faster, and more flexible.

November 26, 2025

Engineering Blog

To Lock or Not: MongoDB’s Lock-Free B-Tree Unlocks Throughput

Realizing that I’ve been building storage engines for nearly two decades has led me to reflect on what’s changed. During that time, I’ve learned what it takes to build trust in my products. Happily, it’s a mutually beneficial equation: the safer I keep users' information, the less likely an emergency is to occur—and the more likely they are to trust and use my work.

November 5, 2025

Engineering Blog

Lower-Cost Vector Retrieval with Voyage AI’s Model Options

Vector search is often the first step in retrieval augmented generation (RAG) systems. In a previous post, we discussed the future of AI-powered search at MongoDB . At MongoDB, we’re making it easier to select an embedding model so your search solution can scale. At scale, the choice of vector representations and dimensionality can have a significant impact on the cost and performance of a vector search system. In this blog post, we discuss options to reduce the storage and compute costs of your vector search solution. The cost of dimensionality Many MongoDB customers use vector indexes on the order of hundreds of gigabytes. An index’s size is determined by the number of documents and the dimensionality (or count of floating-point numbers) of the vector representation that encodes a document’s semantics. For example, it would require ~500 GB to store 41M documents if the embedding model uses 3072 dimensions to represent a document. At query time, each vector similarity computation will require 3072 floating-point operations. However, if these documents could be represented by a 512-dimensional vector, a similar index would only require 84GB of storage—a six-fold reduction in storage costs, as well as less computation at query time. In sum, the dimensionality of a document’s vector representation will directly affect the storage and retrieval costs of the system. Smaller vectors are cheaper to store, index, and query since they are represented with fewer floating-point numbers, but this may come with a tradeoff in accuracy. Put simply, an embedding model converts an input into a vector of fixed dimensions. The model was trained to ensure that vectors from similar documents are close to one another in the embedding space. The amount of storage and compute required for a vector search query is directly proportional to the dimensionality of the vector. If we can reduce the vector representation without compromising retrieval accuracy, our system can more quickly answer queries while using less storage space. Matroyshka representation learning So, how do we shrink vectors without losing meaning? One answer is Matroyshka Representation Learning (MRL) representations. Instead of reducing the vector size through quantization , MRL structures the embedding vector like a stacking doll, in which smaller representations are packed inside the full vector and appear very similar to the larger representation. This means we can select the level of fidelity we would like to use within our system because the similarity between lower-dimensional vectors approximates the similarity of their full-fidelity representations. With MRL representations, we can find the right balance of storage, compute, and accuracy. Figure 1. Visualization of Matroyshka Representation Learning. To use MRL, we must first select an embedding model that was trained for it. When training with MRL, an additional term is added to the loss function that ensures the similarity between lower-dimensional representations approximates the similarities of the full-fidelity counterparts. Voyage AI’s latest text embedding models— voyage-3-large , voyage-3.5 , and voyage-3.5-lite —are trained with MRL terms and allow the user to specify an output dimension of 256, 512, 1024, and 2048. We can use the output_dimension parameter to specify which representation we want to consider. Let’s see how the similarities among shorter vectors can approximate the similarities among full-fidelity vectors with voyage-3.5 : def cosine_similarity(v1,v2): magnitude_v1 = 0.0 magnitude_v2 = 0.0 dot_product = 0.0 for i,v in enumerate(v1): dot_product += v1[i]*v2[i] magnitude_v1 += v1[i]**2 magnitude_v2 += v2[i]**2 return dot_product/(math.sqrt(magnitude_v1)*math.sqrt(magnitude_v2)) # Calculating cosine similarities for MRL representations cosine_similarity(query[:256],relevant_doc[:256]) cosine_similarity(query[:256],non_relevant_doc[:256]) cosine_similarity(query[:512],relevant_doc[:512]) cosine_similarity(query[:512],non_relevant_doc[:512]) cosine_similarity(query[:1024],relevant_doc[:1024]) cosine_similarity(query[:1024],non_relevant_doc[:1024]) cosine_similarity(query,relevant_doc) cosine_similarity(query,non_relevant_doc]) We’ll use three MRL vectors, a query, a relevant document vector, and a non-relevant document vector. We expect the cosine similarity between the query and the relevant document vector to be larger than the similarity between the query and a non-relevant document vector. Cosine similarity measures the alignment between two vectors, where a high score indicates that the vectors point in the same direction in the embedding space. A score of 1.0 means the query and document vectors are identical, and a score of 0.0 means the vectors are orthogonal. Let’s see if that’s the case: Table 1. Similarity scores under several MRL sub-dimensions. The full-fidelity similarity scores can be approximated well with the 256, 512, and 1024 dimension vectors, so using all 2048 dimensions may not be necessary. For example, the full fidelity scores, 0.702 and 0.340, are close to the cosine similarities of the 512-dimensional representations, 0.704 and 0.346. This suggests that indexes built using the shorter vectors will have similar performance to indexes that use the 2048-dimensional vectors. MongoDB vector search collections. We will generate four vector search indexes, each with a different MRL configuration, and measure the retrieval performance on the dataset’s queries for each index (for existing vectors, we can build MRL indexes with Views ). We will examine the normalized discounted cumulative gain (NDCG) and the mean reciprocal rank (MRR). The results are below: table, th, td { border: 1px solid black; border-collapse: collapse; } th, td { padding: 5px; } Dimensions NDCG@10 MRR@10 Relative Performance Storage for 100M Vectors 256 0.703 0.653 0.963 102GB 512 0.721 0.672 0.987 205GB 1024 0.729 0.681 0.998 410GB 2048 0.730 0.682 1.000 820GB We can then analyze the plot of relative accuracy versus storage costs: Figure 2. Retrieval accuracy versus storage costs. The results indicate that for this corpus, we can represent documents with vectors of 512 dimensions, as the system provides retrieval accuracies comparable to those of higher-dimensional vectors, while achieving a significant reduction in storage and compute costs. This choice dramatically cuts the amount of storage and compute required for vector retrieval, so our system will provide the best retrieval quality for each dollar spent on storage, achieving ~99% relative performance at a quarter of the storage and compute cost. Faster and cheaper search This blog post demonstrates that we can easily assess shorter vector representations to reduce the cost of our vector search systems using MRL parameters exposed in VoyageAI’s models. Using retrieval quality analysis tools, we discover that a vector 25% the length of the full fidelity representation is suitable for our use case, so our system will be less expensive and faster. MRL options enable our customers to select the optimal representation for their data. Evaluating new vector search options can lead to improved overall system performance. We’re continuing to make it easy to tune vector search solutions and will be releasing additional features to tune and measure search system performance. For more information about Voyage AI's models, check out the Voyage AI documentation page . Join our MongoDB Community to learn about upcoming events, hear stories from MongoDB users, and connect with community members from around the world.

August 6, 2025

Engineering Blog

Rapid Prototyping a Safe, Logless Reconfiguration Protocol for MongoDB with TLA+

MongoDB provides high availability and fault tolerance using replica sets, which are a group of database servers that operate a Raft-like consensus protocol. Each database write operation is replicated in a sequential log (the oplog ) and applied to all replicas. The consensus protocol guarantees that once an oplog entry is committed on a majority of replica set nodes, the write will be durable even if some nodes fail. Over time, however, we may need to change the set of servers operating within a replica set, to remove or replace failed nodes, a problem known as dynamic reconfiguration . Reconfiguration is a critical operation within replica sets for dynamically expanding a cluster or replacing unhealthy nodes, so its correctness is crucial for enabling customer confidence in these operations and overall reliability within a replica set or sharded cluster. In 2019, we needed to implement a new, safe reconfiguration protocol with rigorous correctness guarantees. At the time, the MongoDB replication system had an existing, legacy reconfiguration mechanism, but it had several known correctness bugs which necessitated a new protocol design. Although the existing protocol had correctness issues, it also had some attractive design characteristics. In particular, it decoupled reconfigurations from the main database operation log and employed a logless design, storing configurations as single objects and replicating them between nodes in a gossip-based manner. Therefore, as part of our design process, we had a goal of developing a new, safe reconfiguration protocol while minimizing changes to this existing, legacy gossip-based reconfiguration protocol. We knew that dynamic reconfiguration protocols were notoriously difficult to design correctly, so we needed a design approach that would allow us to proceed efficiently and with high confidence. With the help of formal specification and model checking tools—specifically TLA+ and its model checker, TLC—we were able to embark on a process of rapidly developing the design of a new, safe, logless reconfiguration protocol in just a couple of weeks, and implementing it in production in a few months. In this post, we discuss our process of formally modeling the legacy reconfiguration protocol in TLA+, characterizing its bugs with a model checker, and iteratively developing modifications to lead to a safe, logless reconfiguration protocol design. There were a few key, high-level takeaways from our process. Most notably, rigorous, formal modeling didn’t slow us down, but instead accelerated design and delivery timelines while maintaining a high correctness bar. It also led to a simpler protocol design, allowing maintenance of a unified reconfiguration engine, rather than dealing with two parallel protocols, which could be prone to unexpected interactions and maintenance burden. The new protocol also provided novel performance benefits over standard reconfiguration approaches, due to the decoupling of reconfigurations from the main database log. Background and motivation The original MongoDB replication system used a legacy, gossip-based reconfiguration protocol that was fully decoupled from the main oplog. Each configuration was ordered by a numeric, monotonic config version, and nodes in a replica set learned the latest config from each other via periodic heartbeat messages. Upon learning of a higher config, it was immediately installed and took effect on that node. We refer to this original protocol design as logless, since it stored configurations as a single object and propagated them in a gossip-based manner, with no use of a sequential log for recording and replicating reconfiguration operations. This protocol also had a “force reconfig” feature, allowing users to install a new configuration even if a majority of nodes were offline. While the legacy protocol performed well in most scenarios, it was known to be unsafe in certain cases. Moreover, we expected reconfiguration to become a more common operation in MongoDB, necessitating the development of a new, safe reconfiguration protocol. Initially, we considered Raft's existing reconfiguration protocols , including its single-node reconfiguration protocol, which restricts reconfigurations to adding or removing a single server. The standard Raft approach, however, was ultimately deemed incompatible with "force reconfig," and would require maintenance of both a new, log-based implementation and the legacy, gossip-based one. It would also be complicated to ensure the two protocols didn’t interfere with each other. Instead, we hoped to develop a new protocol that minimized changes to the existing legacy protocol to simplify design and implementation. Ideally, we would be able to adopt ideas from Raft’s single-node reconfiguration protocol to our gossip-based, legacy reconfig protocol—which would allow for better compatibility with "force" reconfig, would be easier to upgrade and downgrade, and would eliminate the need for a new oplog entry format for reconfigurations. This idea of developing a safe, logless reconfiguration protocol seemed promising, as it would eliminate the need to mix two protocols and to share the basic mechanism for both normal and force reconfigurations. We needed, however, to be very confident in the correctness of such an approach, which was difficult to do manually and with a short design time frame. When we first pitched this idea early in the design process, it was unclear if such a solution was possible and whether it could be successful and implemented safely in production. There was some existing work on decoupling reconfigurations and on logless consensus , but none that directly applied to a Raft-based consensus system such as ours. Also, the discovery of a critical safety bug in one of Raft's reconfiguration protocols after its initial publication highlighted how challenging the task of designing or modifying reconfiguration protocols for consensus systems can be. This bug was only discovered over a year after Raft’s initial publication and required subtle protocol modifications to address. Around that time, in 2019, MongoDB’s replication team had had some past success with TLA+ and model checking on similar protocol design problems. Encouraged by these experiences, we set off to employ TLA+ and its model checker, TLC, to rapidly iterate on a candidate design and to develop a safe, logless reconfiguration protocol design that was simpler, easier to implement, and which provided novel performance benefits. Modeling the legacy protocol We were focused on developing a reconfiguration protocol that minimized design changes to the existing system, so we started by developing a TLA+ specification of the legacy reconfiguration protocol. This allowed us to characterize the flaws in this legacy protocol precisely and guide us towards modifications needed to make the protocol safe. To model the legacy, gossip-based protocol, we extended an existing TLA+ specification we had developed for an abstract version of the MongoDB replication protocol that did not include reconfiguration behavior. We extended this specification with two key reconfiguration-related actions: a Reconfig action, which represents the installation of a new config on a primary node, and a SendConfig action, which gossips a new config with a higher config version from one node to another. This model also defines the high-level safety properties of the protocol. The fundamental external guarantee is that when a majority write is committed on a replica set, the write will be durable as long as a majority of nodes are alive. This guarantee is largely captured in the LeaderCompleteness property, stating that any new leader in a higher term must contain log entries committed in earlier terms. Along with this, we also include a lower-level correctness property of Raft-based systems, ElectionSafety , which states that there can never be two primaries in the same term. Iteratively strengthening our reconfiguration rules Our legacy protocol model and its underlying correctness properties served as the starting point for a series of experiments, guided by the model checker, that iteratively led us towards a safe protocol design. We explored a series of design candidates by incrementally analyzing and refining our design in response to counterexamples discovered by the model checker. Single node changes One of the fundamental, challenging aspects of dynamic reconfiguration is related to the fact that the notion of “quorum” (i.e., majority) changes when the set of servers operating the protocol changes. For example, consider a reconfiguration that expands the protocol’s set of servers from C1={n 1 , n 2 , n 3 } to C2={n 1 , n 2 , n 3 , n 4 , n 5 }. Contacting a quorum in C1 may (correctly) contact servers Q1={n 1 , n 2 }, but a valid quorum in C2 may be Q2={n 3 , n 4 , n 5 }, which is problematic since Q1 and Q2 do not intersect, a key property of all quorums in standard Raft (and most other practical consensus protocols). Raft’s single-node approach attempts to partially address this by restricting configuration changes to those that add or remove a single node, which enforces overlapping quorums between such configurations. So we started by considering a basic initial question: does enforcing single-node changes partially address the safety issues of the legacy protocol? We had expected this would not be a fully sufficient condition for safety, but it was a stepping stone towards safer protocol revisions, and we wanted to confirm each of our hypotheses along the way. We introduced the single node change rule in the Reconfig action, which ensures that any majority of nodes in the old config and any majority of nodes in the new config share at least one common node. In our specification, we employed a slightly generalized definition of this property, which allows reconfigurations between any nodes where majority quorums overlap, even if not strictly a single-node change (e.g. all majority quorums of C1={n 1 , n 2 } and C2={n 1 , n 2 , n 3 , n 4 } intersect, but you cannot move from one to the other via a single addition/removal). One of the benefits of specifying the protocol in a high-level, mathematical specification language like TLA+ is that it enables concise definition of these kinds of properties, as seen below. After adding this condition to our Reconfig action, TLC was able to produce a violation trace for this updated protocol in a few seconds, and this bug was clear to understand, as shown below (only modified variables are shown in each state): Essentially, single-node changes only guarantee safe quorum intersection between adjacent configurations, but a series of locally adjacent reconfigurations may lead to a globally unsafe situation—i.e., two configurations that are both active but violate the quorum overlap property. This is demonstrated in the above trace concretely, and leads to a violation of the ElectionSafety property, with two nodes acting as primary in the same term in State 6. Node n 1 was safely elected in configuration {n 1 }, but then two subsequent reconfigurations occur to move the system to {n 1 , n 2 , n 3 }, and n 2 is elected in this configuration with a quorum of {n 2 , n 3 }, with no intersection of the original quorum of config {n 1 }. Our initial expectation was that just adding the single-node change constraint would not be correct by itself, but it was reassuring to have the model checker confirm this with a counterexample in just a few seconds. This began to give us more confidence to iterate on a new protocol design, which we proceeded to develop over the next week or so, next moving on to understand a deeper investigation of protocol safety requirements. Config commitment rule Adopting the single-node change condition is straightforward, as it only requires verifying new configurations in a pairwise, local manner. As we saw above, though, it is still problematic to move through arbitrary sequences of overlapping configurations, so we need to take extra care to avoid these problematic cases. Our first hunch was to add an explicit notion of “config commitment” within the protocol, similar to the commitment rules of Raft. That is, restrict a reconfiguration from taking place until some appropriate commitment conditions have been satisfied. Intuitively, this would place restrictions on how quickly, for example, a primary could execute reconfigurations—i.e., it would prevent a primary from moving to a new configuration before an older, non-overlapping configuration was, in a sense, “deactivated.” One natural idea was to borrow some similar concepts from Raft on log commitment, adapted for our logless, gossip-based setting. After a few iterations, we developed the following additional preconditions for the Reconfig action: ConfigQuorumCheck : A quorum of nodes have the same config version as the primary executing the reconfig. TermQuorumCheck : A majority of nodes in the primary’s config have reached the term of the primary or newer. We modeled the protocol with these new TermQuorumCheck and ConfigQuorumCheck , and they were initially sufficient to rule out the counterexamples we encountered previously. They were not yet fully general to ensure safety, though, as we will see below, where we worked out a final solution for config commitment. Oplog commitment rule In addition to the "config commitment" idea, it is worth noting the relationship between the config and oplog caused by the divergence from Raft. Raft sequences a reconfiguration among other oplog entries, thereby establishing a strong implicit ordering among them. However, since the gossip-based reconfig protocol does not include the configuration as part of the oplog, there may be some implicit dependencies between oplog entries and configurations that are not accounted for. We had started to think about this interaction between oplog entry commitment and reconfiguration, and conjectured a few problematic scenarios that we were able to confirm with the model checker. An example of this problem is illustrated by the following, simplified error trace: The core issue here is that config C3={n 1 , n 2 , n 3 } (with version=3) is installed even though the entry <<1,1>> (index, term) that was committed in a previous configuration, C1={n 1 }, has not been committed in the current configuration, C2={n 1 ,n 2 }. Since quorums may not overlap for non-adjacent configurations (e.g., C1 and C3), by ensuring that the commitment of writes in a previous configuration is also guaranteed in the current configuration, we can "propagate" the durability guarantee of earlier configurations to the future. As a result, we need to explicitly check this property when accepting reconfiguration commands. The rules for accepting a new configuration now include this additional, newly developed precondition: This rule is about ensuring that durable, replicated log entries from older configs are transferred to newer configs, which must be upheld to ensure safe protocol operation over time. This feature is implicit in Raft reconfiguration due to the tight coupling of reconfigurations and the main operation oplog, but must be handled explicitly here due to the decoupled design. The config as logless state machine We were now confident that we had established strong rules to guarantee local quorum overlap, the proper sequential ordering of configs, and the appropriate transfer of oplog entries between configs. After re-checking our model with these new preconditions, though, the model checker discovered a new counterexample after running for several hours on a larger workstation. The following is a simplified version of this error trace: In this case, node n 1 executes a reconfig to Ca={n 1 ,n 2 ,n 3 }, but hasn't propagated it to any other nodes at state 3. Then, n 2 becomes the primary and reconfigures to config Cb={n 1 ,n 2 ,n 4 } in state 6. n 1 can then be elected in term 3 with quorum {n 1 ,n 3 }, and n 2 can be elected in term 3 with quorum {n 2 ,n 4 }, violating the ElectionSafety property. The problem in the above trace is that when n 2 moved to a new config, it should have ensured that, in the future, no leaders would ever be elected in “earlier” configs. It failed to do so and, in the last step, a quorum was then able to be formed in a config with version 2, leading to two active, non-overlapping quorums. A key here is that divergence between configs in different terms leads to the issue. That is, config commitment as we did above was sufficient for a sequence of reconfigs by a single leader, but not with concurrent leaders in competing terms. Figure 1. Concurrent configurations with non-intersecting majority quorums. After going through these counterexamples, we understood the problem more clearly and had a path to refine our correctness argument. We realized that agreeing on the configuration among nodes can be viewed as a separate kind of consensus problem, separate from the oplog consensus but with similar rules. In our system, the config itself can be viewed as a compacted (i.e., rolled up) replicated state machine (RSM) that does not require a log (i.e., it is “logless”), since explicit maintenance of config history isn’t needed and only the latest config takes effect. Propagating the config via heartbeats can be viewed as “appending” to the config log (e.g. as in Raft), and rolling back a config is never explicitly required—i.e., we always simply install a more up-to-date config. This config RSM already shares many similarities with the oplog RSM, such as term propagation. The similarity suggests that just using the config version to identify a config is not sufficient. Viewing the config as its own RSM, we need to assign the primary’s term to configs. The config term is then a separate property of the config, similar to how the oplog entry’s term is part of every oplog entry. Thus, a config should be defined and ordered by the tuple (configVersion, configTerm), analogous to how an oplog entry is identified and ordered by its (timestamp, term), with term being compared first, followed by timestamp/version. The elections of these two consensus protocols can then be merged together by adding a new rule that a voter checks if the candidate’s config is stale in addition to other checks. Moreover, we can borrow the definition of “commitment” from the oplog RSM to the config RSM. That is, when a config is propagated to a majority of nodes in the primary’s term, the config is committed. It also became clear that the RSM only moves ahead through committed configs sequentially - the config RSM can choose the next config and commit it only if its current one is committed. Putting it all together Our final protocol specification included all of the above preconditions and features, producing a version of the protocol which we refer to as safe, logless dynamic reconfiguration. We conducted final model checking runs for several cases over 20 hours, exploring over 800 million protocol states, with configurations of four and five servers, along with pen and paper explanations for the correctness of the final result. Note that, at a high level, we can understand dynamic reconfiguration protocols like this as needing to deal with two core conceptual aspects: (1) config deactivation and (2) state transfer. Our various config commitment rules combine to address the first, which is related to ensuring that different configs that diverge over time cannot both be concurrently active. Aspect (2) relates to the fact that various types of replicated, durable state within a configuration must be appropriately transferred over to newer configurations. This is what the oplog commitment rules address, as well as the rules for ensuring that the term state propagates appropriately between configurations. Once we had the abstract protocol ironed out and gained confidence in its correctness, we were ready to move forward swiftly to implementation, and completed it in the MongoDB replication system over the course of a few months. The protocol has been running reliably in MongoDB and in production for several years since its introduction, and the implementation and protocol were significantly simpler than our original design alternatives. Takeaways Overall, we were able to get a draft protocol in one week, and within two weeks we finalized the protocol and successfully passed correctness checks using the model checker. It was motivating to see our vague ideas turn into something tangible, and the successful outcome from this design phase gave us the confidence to move forward to the implementation phase. Model checking is an excellent tool for rapidly and precisely answering "what if" design questions. Our efforts also emphasized an important feature of lightweight, design-level formal methods techniques, which is about more than simply ensuring correctness of your system design. Rather, it enables the exploration of protocol optimizations at a level of aggressiveness and velocity that would typically be infeasible with manual design methods. From this perspective, we can view these formal methods tools as not only a means for improving correctness of our systems and protocols, but as a means for efficient exploration of the optimization design space while maintaining a high correctness bar. This also speaks to the potential value of investing some amount of time upfront in models for key protocols that are highly critical and may need to evolve over time. Due to our novel protocol design, the scope of the implementation changes also became much smaller. We delivered the project in three months with three to four developers, and "force reconfig" was implemented using the same mechanism with relaxed rules. Version upgrade/downgrade only involves a small on-disk format change of the config, avoiding switching between two different reconfig approaches. In addition, our approach also provided potential performance improvements. Specifically, the decoupled reconfiguration design can bypass the oplog to recover the system when the oplog becomes the bottleneck. Similar ideas have since been explored in other, recent reconfiguration protocols like Matchmaker Paxos . Since its introduction in MongoDB 4.4 in 2019, the new, logless reconfiguration protocol has proven to be reliable and has served as a solid building block for other features, such as automatically assigning new nodes votes only after their initial sync. There have been no significant protocol bugs discovered since its deployment, a testament to the value of these rigorous protocol design techniques. While we focused on the intuition of the new protocol and the experience of leveraging model checking in this article, our paper , published in OPODIS 2021, includes a much more detailed description of the reconfiguration protocol, and a formal safety proof was also published. The final versions of the specifications we developed and published can be found in this Github repository , as well as some of the original specs we used in the MongoDB repository .

July 2, 2025

Engineering Blog

Conformance Checking at MongoDB: Testing That Our Code Matches Our TLA+ Specs

Some features mentioned below have been sunset since this paper was originally written. Visit our docs to learn more. At MongoDB, we design a lot of distributed algorithms—algorithms with lots of concurrency and complexity, and dire consequences for mistakes. We formally specify some of the scariest algorithms in TLA+, to check that they behave correctly in every scenario. But how do we know that our implementations conform to our specs? And how do we keep them in sync as the implementation evolves? This problem is called conformance checking. In 2020, my colleagues and I experimented with two MongoDB products, to see if we could test their fidelity to our TLA+ specs. Here's a video of my presentation on this topic at the VLDB conference. (It'll be obvious to you that I recorded it from my New York apartment in deep Covid lockdown.) Below, I write about our experience with conformance checking from 2025's perspective. I'll tell you what worked for us in 2020 and what didn't, and what developments there have been in the field in the five years since our paper. Agile modelling Our conformance-checking project was born when I read a paper from 2011—"Concurrent Development of Model and Implementation"—which described a software methodology called eXtreme Modelling. The authors argued that there's a better way to use languages like TLA+, and I was convinced. They advocated a combination of agile development and rigorous formal specification: Multiple specifications model aspects of the system. Specifications are written just prior to the implementation. Specifications evolve with the implementation. Tests are generated from the model, and/or trace-checking verifies that test traces are legal in the specification. I was excited about this vision. Too often, an engineer tries to write one huge TLA+ spec for the whole system. It's too complex and detailed, so it's not much easier to understand than the implementation code, and state-space explosion dooms model checking. The author abandons the spec and concludes that TLA+ is impractical. In the eXtreme Modelling style, a big system is modeled by a collection of small specs, each focusing on an aspect of the whole. This was the direction MongoDB was already going, and it seemed right to me. In eXtreme Modelling, the conformance of the spec and implementation is continuously tested. The authors propose two conformance checking techniques. To understand these, let's consider what a TLA+ spec is: it's a description of an algorithm as a state machine. The state machine has a set of variables, and each state is an assignment of specific values to those variables. The state machine also has a set of allowed actions, which are transitions from one state to the next state. You can make a state graph by drawing states as nodes and allowed actions as edges. A behavior is any path through the graph. This diagram shows the whole state graph for some very simple imaginary spec. One of the spec's behaviors is highlighted in green. Figure 1. A formal spec's state graph, with one behavior highlighted. The spec has a set of behaviors B spec , and the implementation has a set of behaviors B impl . An implementation refines a spec if B impl ⊂ B spec . If the converse is also true, if B spec ⊂ B impl , then this is called bisimulation , and it's a nice property to have, though not always necessary for a correctly implemented system. You can test each direction: Test-case generation: For every behavior in B spec , generate a test case that forces the implementation to follow the same sequence of transitions. If there's a spec behavior the implementation can't follow, then B spec ⊄ B impl , and the test fails. Trace-checking: For every behavior in B impl , generate a trace: a log file that records the implementation's state transitions, including all implementation variables that match spec variables. If the behavior recorded in the trace isn't allowed by the spec, then B impl ⊄ B spec and the test fails. Figure 2. Two ways to test that the spec's behaviors are the same as the implementation's. Non-conforming behaviors are highlighted in red. Both techniques can be hard, of course. For test-case generation, you must somehow control every decision the implementation makes, squash all nondeterminism, and force it to follow a specific behavior. If the spec's state space is huge, you have to generate a huge number of tests, or choose an incomplete sample. Trace-checking, on the other hand, requires you to somehow map the implementation's state back to the spec's, and log a snapshot of the system state each time it changes—this is really hard with multithreaded programs and distributed systems. And you need to make the implementation explore a variety of behaviors, via fault-injection and stress-testing, and so on. Completeness is usually impossible. We found academic papers that demonstrated both techniques on little example applications, but we hadn’t seen them tried on production-scale systems like ours. I wanted to see how well they work, and what it would take to make them practical. I recruited my colleagues Judah Schvimer and Max Hirschhorn to try it with me. Judah and I tried trace-checking the MongoDB server (in the next section), and Max tried test-case generation with the MongoDB Mobile SDK (the remainder of this article). Figure 3. We tried two conformance checking techniques on two MongoDB products. Trace-checking the MongoDB server For the trace-checking experiment, the first step Judah and I took was to choose a TLA+ spec. MongoDB engineers had already written and model-checked a handful of specs that model different aspects of the MongoDB server (see this presentation and this one ). We chose RaftMongo.tla , which focuses on how servers learn the commit point, which I'll explain now. MongoDB is typically deployed as a replica set of cooperating servers, usually three of them. They achieve consensus with a Raft-like protocol . First, they elect one server as the leader. Clients send all writes to the leader, which appends them to its log along with a monotonically increasing logical timestamp. Followers replicate the leader's log asynchronously, and they tell the leader how up-to-date they are. The leader keeps track of the commit point—the logical timestamp of the newest majority-replicated write. All writes up to and including the commit point are committed, all the writes after it are not. The commit point must be correctly tracked even when leaders and followers crash, messages are lost, a new leader is elected, uncommitted writes are rolled back, and so on. RaftMongo.tla models this protocol, and it checks two invariants: A safety property, which says that no committed write is ever lost, and a liveness property, which says that all servers eventually learn the newest commit point. Figure 4. MongoDB replica set servers and their logs. Judah and I wanted to test that MongoDB's C++ implementation matched our TLA+ spec, using trace-checking. Here are the steps: Run randomized tests of the implementation. Collect execution traces. Translate the execution traces into TLA+. Check the trace is permitted by the spec. Figure 5. The trace-checking workflow. The MongoDB server team has hundreds of integration tests handwritten in JavaScript, from which we chose about 300 for this experiment. We also have randomized tests; we chose one called the "rollback fuzzer" which does random CRUD operations while randomly creating and healing network partitions, causing uncommitted writes to be logged and rolled back. We added tracing code to the MongoDB server and ran each test with a three-node replica set. Since all server processes ran on one machine and communicated over localhost, we didn't worry about clock synchronization: we just merged the three logs, sorting by timestamp. We wrote a Python script to read the combined log and convert it into a giant TLA+ spec named Trace.tla with a sequence of states for the whole three-server system. Trace.tla asserted only one property: "This behavior conforms to RaftMongo.tla." Here's some more detail about the Python script. At each moment during the test, the system has some state V, which is the values of the state variables for each node. The script tries to reconstruct all the changes to V and record them in Trace.tla. It begins by setting V to a hardcoded initial state V0, and outputs it as the first state of the sequence: \* Each TLA+ tuple is \* <<action, committedEntries, currentTerm, log, role, commitPoint, \* serverLogLocation>> \* We know the first state: all nodes are followers with empty logs. Trace == << <<"Init", \* action name <<"Follower","Follower","Follower">>, \* role per node <<1, 1, 1>>, \* commitPoint per node <<<<...>>,<<...>>,<<...>>>>, \* log per node "">>, \* trace log location (empty) \* ... more states will follow ... The script reads events from the combined log and updates V. Here's an example where Node 1 was the leader in state Vi, then Node 2 logs that it became leader. The script combines these to produce Vi+1 where Node 2 is the leader and Node 1 is now a follower. Note, this is a lie. Node 1 didn't actually become a follower in the same instant Node 2 became leader. Foreshadowing! This will be a problem for Judah and me. Figure 6. Constructing the next state from a trace event. Anyway, the Python script appends a state to the sequence in Trace.tla: Trace == << \* ... thousands of events ... <<"BecomePrimary", \* action name for debugging <<"Follower","Leader","Follower">>, \* role per node <<1, 1, 1>>, \* commitPoint per node <<<<...>>,<<...>>,<<...>>>>, \* log per node \* trace log location, for debugging: "/home/emptysquare/RollbackFuzzer/node2.log:12345">>, \* ... thousands more events ... >> We used the Python script to generate a Trace.tla file for each of the hundreds of tests we'd selected: handwritten JavaScript tests and the randomized "rollback fuzzer" test. Now we wanted to use the model-checker to check that this state sequence was permitted by our TLA+ spec, so we know our C++ code behaved in a way that conforms to the spec. Following a technique published by Ron Pressler , we added these lines to each Trace.tla: VARIABLES log, role, commitPoint \* Instantiate our hand-written spec, RaftMongo.tla. Model == INSTANCE RaftMongo VARIABLE i \* the trace index \* Load one trace event. Read == /\ log = Trace[i][4] /\ role = Trace[i][5] /\ commitPoint = Trace[i][6] ReadNext == /\ log' = Trace[i'][4] /\ role' = Trace[i'][5] /\ commitPoint' = Trace[i'][6] Init == i = 1 /\ Read Next == \/ i < Len(Trace) /\ i' = i + 1 /\ ReadNext \/ UNCHANGED <<i, vars>> \* So that we don’t get a deadlock error in TLC TraceBehavior == Init /\ [][Next]_<<vars, i>> \* To verify, we check the spec TraceBehavior in TLC, with Model!SpecBehavior \* as a temporal property. We run the standard TLA+ model-checker ("TLC"), which tells us if this trace is an allowed behavior in RaftMongo.tla. But this whole experiment failed. Our traces never matched our specification. We didn't reach our goal, but we learned three lessons that could help future engineers. What disappointment taught us Lesson one: It's hard to snapshot a multithreaded program's state. Each time a MongoDB node executes a state transition, it has to snapshot its state variables in order to log them. MongoDB is highly concurrent with fairly complex locking within each process—it was built to avoid global locking. It took us a month to figure out how to instrument MongoDB to get a consistent snapshot of all these values at one moment. We burned most of our budget for the experiment, and we worried we'd changed MongoDB too much (on a branch) to test it realistically. The 2024 paper "Validating Traces of Distributed Programs Against TLA+ Specifications" describes how to do trace-checking when you can only log some of the values (see my summary at the bottom of this page). We were aware of this option back in 2020, and we worried it would make trace-checking too permissive; it wouldn't catch every bug. Lesson two: The implementation must actually conform to the spec. This is obvious to me now. After all, conformance checking was the point of the project. In our real-life implementation, when an old leader votes for a new one, first the old leader steps down, then the new leader steps up. The spec we chose for trace-checking wasn't focused on the election protocol, though, so for simplicity, the spec assumed these two actions happened at once. (Remember I said a few paragraphs ago, "This is a lie"?) Judah and I knew about this discrepancy—we'd deliberately made this simplification in the spec. We tried to paper over the difference with some post-processing in our Python script, but it never worked. By the end of the project, we decided we should have backtracked, making our spec much more complex and realistic, but we'd run out of time. The eXtreme Modelling methodology says we should write the spec just before the implementation. But our spec was written long after most of the implementation, and it was highly abstract. I can imagine another world where we knew about eXtreme Modelling and TLA+ at the start, when we began coding MongoDB. In that world, we wrote our spec before the implementation, with trace-checking in mind. The spec and implementation would've been structured similarly, and this would all have been much easier. Lesson three: Trace-checking should extend easily to multiple specs. Judah and I put in 10 weeks of effort without successfully trace-checking one spec, and most of the work was specific to that spec, RaftMongo.tla. Sure, we learned general lessons (you're reading some of them) and wrote some general code, but even if we'd gotten trace-checking to work for one spec we'd be practically starting over with the next spec. Our original vision was to gather execution traces from all our tests, and trace-check them against all of our specifications, on every git commit. We estimated that the marginal cost of implementing trace-checking for more specs wasn't worth the marginal value, so we stopped the project. Practical trace-checking If we started again, we'd do it differently. We'd ensure the spec and implementation conform at the start, and we'd fix discrepancies by fixing the spec or the implementation right away. We'd model easily observed events like network messages, to avoid snapshotting the internal state of a multithreaded process. I still think trace-checking is worthwhile. I know it's worked for other projects. In fact MongoDB is sponsoring a grad student Finn Hackett , whom I'm mentoring, to continue trace-checking research. Let's move on to the second half of our project. Test-case generation for MongoDB Mobile SDK The MongoDB Mobile SDK is a database for mobile devices that syncs with a central server (since we wrote the paper, MongoDB has sunsetted the product ). Mobile clients can make changes locally. These changes are periodically uploaded to the server and downloaded by other clients. The clients and the server all use the same algorithm to resolve write conflicts: Operational Transformation , or OT. Max wanted to test that the clients and server implement OT correctly, meaning they resolve conflicts the same way, eventually resulting in identical data everywhere. Originally, the clients and server shared one C++ implementation of OT, so we knew they implemented the same algorithm. But in 2020, we'd recently rewritten the server in Go, so testing their conformance became urgent. Figure 7. MongoDB mobile SDK. My colleague Max Hirschhorn used test-case generation to check conformance. This technique goes in the opposite direction from trace-checking: trace-checking starts with an implementation and checks that its behaviors are allowed by the spec, but test-case generation starts with a spec and checks that its behaviors are in the implementation. But first, we needed a TLA+ spec. Before this project, the mobile team had written out the OT algorithm in English and implemented it in C++. Max manually translated the algorithm from C++ to TLA+. In the mobile SDK, clients can do 19 kinds of operations on data; six of these can be performed on arrays, resulting in 21 array merge rules, which are implemented in about 1000 lines of C++. Those 21 rules are the most complex, and Max focused his specification there. He used the model-checker to verify that his TLA+ spec ensured all participants eventually had the same data. This translation was a gruelling job, but the model-checker caught Max's mistakes quickly, and he finished in two weeks. There was one kind of write conflict that crashed the model-checker: if one participant swapped two array elements, and another moved an element, then the model-checker crashed with a Java StackOverflowError. Surprisingly, this was an actual infinite-recursion bug in the algorithm. Max verified that the bug was in the C++ code. It had hidden there until he faithfully transcribed it into TLA+ and discovered it with the model-checker. He disabled the element-swap operation in his TLA+ spec, and the mobile team deprecated it in their implementation. To test conformance, Max used the model-checker to output the entire state graph for the spec. He constrained the algorithm to three participants, all editing a three-element array, each executing one (possibly conflicting) write operation. With these constraints, the state space is a DAG, with a finite number of behaviors (paths from an initial state to a final state). There are 30,184 states and 4913 behaviors. Max wrote a Go program to parse the model-checker's output and write out a C++ unit test for each behavior. Here’s an example unit test. (It's edited down from three participants to two.) At the start, there's an array containing {1, 2, 3}. One client sets the third element of an array to 4 and the second client removes the second element from the array. The test asserts that both clients agree the final array is {1, 4}. TEST(Transform_Array) { size_t num_clients = 2; TransformArrayFixture fixture{test_context, num_clients, {1, 2, 3}}; fixture.transaction(0, [](TableRef array) { array->set_int(0, 2, 4); }); fixture.transaction(1, [](TableRef array) { array->remove(1); }); fixture.sync_all_clients(); fixture.check_array({1, 4}); fixture.check_ops(0, {ArrayErase{1}}); fixture.check_ops(1, {ArraySet{1, 4}}); } These 4913 tests immediately achieved 100% branch coverage of the implementation, which we hadn't accomplished with our handwritten tests (21%) or millions of executions with the AFL fuzzer (92%). Retrospective Max's test-case generation worked quite well. He discovered a bug in the algorithm, and he thoroughly checked that the mobile SDK's Operational Transformation code conforms to the spec. Judah's and my trace-checking experiment didn't work: our spec and code were too far apart, and adding tracing to MongoDB took too long. Both techniques can work, given the right circumstances and strategy. Both techniques can fail, too! We published our results and lessons as a paper in VLDB 2020, titled " eXtreme Modelling in Practice ." In the subsequent five years, I've seen some progress in conformance checking techniques. Test-case generation: Model Checking Guided Testing for Distributed Systems . The "Mocket" system generates tests from a TLA+ spec, and instruments Java code (with a fair amount of human labor) to force it to deterministically follow each test, and check that its variables have the same values as the spec after each action. The authors tested the conformance of three Java distributed systems and found some new bugs. Their technique is Java-specific but could be adapted for other languages. Multi-Grained Specifications for Distributed System Model Checking and Verification . The authors wrote several new TLA+ specs of Zookeeper, at higher and lower levels of abstraction. They checked conformance between the most concrete specs and the implementation, with a technique similar to Mocket: a human programmer instruments some Java code to map Java variables to spec variables, and to make all interleavings deterministic. The model-checker randomly explores spec behaviors, while the test framework checks that the Java code can follow the same behaviors. SandTable: Scalable Distributed System Model Checking with Specification-Level State Exploration . This system is not language-specific: it overrides system calls to control nondeterminism and force the implementation to follow each behavior of the spec. It samples the spec's state space to maximize branch coverage and event diversity while minimizing the length of each behavior. As in the "Multi-Grained" paper, the SandTable authors wisely developed new TLA+ specs that closely matched the implementations they were testing, rather than trying to use existing, overly abstract specs like Judah and I did. Plus, my colleagues Will Schultz and Murat Demirbas are publishing a paper in VLDB 2025 that uses test-case generation with a new TLA+ spec of MongoDB's WiredTiger storage layer, the paper is titled "Design and Modular Verification of Distributed Transactions in MongoDB." Trace-checking: Protocol Conformance with Choreographic PlusCal . The authors write new specs in an extremely high-level language that compiles to TLA+. From their specs they generate Go functions for trace-logging, which they manually add to existing Go programs. They check that the resulting traces are valid spec behaviors and find some bugs. Validating Traces of Distributed Programs Against TLA+ Specifications . Some veteran TLA+ experts demonstrate in detail how to trace-log from a Java program and validate the traces with TLC, the TLA+ model-checker. They've written small libraries and added TLC features for convenience. This paper focuses on validating incomplete traces: if you can only log some of the variables, TLC will infer the rest. Smart Casual Verification of the Confidential Consortium Framework . The authors started with an existing implementation of a secure consensus protocol. Their situation was like mine in 2020 (new specs of a big old C++ program) and so was their goal: to continuously check conformance and keep the spec and implementation in sync. Using the new TLC features announced in the "Validating Traces" paper above, they toiled for months, brought their specs and code into line, found some bugs, and realized the eXtreme Modelling vision. Finn Hackett is a PhD student I'm mentoring, he's developed a TLA+-to-Go compiler . He's now prototyping a trace-checker to verify that the Go code he produces really conforms to its source spec. We're doing a summer project together with Antithesis to thoroughly conformance-check the implementation's state space. I'm excited to see growing interest in conformance checking, because I think it's a serious problem that needs to be solved before TLA+ goes mainstream. The "Validating Traces" paper announced some new trace-checking features in TLC, and TLC's developers are discussing a better way to export a state graph for test-case generation . I hope these research prototypes lead to standard tools, so engineers can keep their code and specs in sync. Join our MongoDB Community to learn about upcoming events, hear stories from MongoDB users, and connect with community members from around the world.

June 2, 2025

Engineering Blog

Rethinking Information Retrieval in MongoDB with Voyage AI

The future of AI-powered search The role of the modern database is evolving. AI-powered applications require more than just fast, scalable, and durable data management: they need highly accurate data retrieval and intelligent ranking, which are enabled by the ability to extract meaning from large volumes of unstructured inputs like text, images, and video. Retrieval-augmented generation (RAG) is now the default for LLM-powered applications, making accuracy in AI-driven search and retrieval a critical priority for developers. Meanwhile, customers in industries like healthcare, legal, and finance need highly reliable answers to power the applications their users rely on. MongoDB Atlas Search already combines keyword and vector search through its hybrid capabilities. However, to truly meet developers’ needs and expectations, we are expanding our focus to integrating best-in-class embedding and reranking models into Atlas to ensure optimal performance and superior outcomes. These models enable search systems to understand meaning beyond exact words in text, and to recognize semantic similarities across images, video, and audio. Embedding models and rerankers empower customer support teams to quickly match queries with pertinent documents, assist legal professionals in surfacing key clauses within long contracts, and optimize RAG pipelines by retrieving contextually significant information that addresses users’ queries. MongoDB is actively building this future. In February, we announced the acquisition of Voyage AI , a pioneer in state-of-the-art embedding and reranking models. With Voyage’s leading models and Atlas Search, developers will get a unified, production-ready stack for semantic retrieval. Why embedding and reranking matter Embedding and reranking models are core components of modern information retrieval, providing the link between natural language and accurate results: Embedding models transform data into vector representations that capture meaning and context, enabling searches based on semantic similarity rather than just keyword matches. Reranking models improve search accuracy by scoring and ranking a smaller set (e.g., 1000) of documents based on their relevance to a query, ensuring the most meaningful results appear first. A typical system uses an embedding model to project documents into a vector space that encodes semantics. A nearest neighbor search provides a list of documents close to a given query. These results are processed with a reranking model that enables deeper, clause-by-clause comparison between the queries and the nearest neighbors. This combination can greatly improve retrieval accuracy. For example, the system processing a user query for “holiday cookie recipes without tree nuts” may first retrieve a set of holiday recipes with the nearest neighbor search. In reranking, the query would be fully compared to each retrieved document to ensure each recipe does not contain any nuts. Voyage AI’s embedding and reranking models Voyage offers a suite of embedding models that support both general-purpose use cases and domain-specific needs . General models like voyage-3 , voyage-3-large , and voyage-3-lite handle diverse text inputs. For specialized applications, Voyage provides models tailored to domains like code ( voyage-code-3 ), legal ( voyage-law-2 ), and finance ( voyage-finance-2 ), offering higher accuracy by capturing the context and semantics unique to each field. They also offer a multimodal model ( voyage-multimodal-3 ) capable of processing interleaved text and images. In addition, Voyage provides reranking models in standard and lite versions , each focused on optimizing relevance while keeping latency and computational load under control. Voyage’s embedding models are designed to optimize the two distinct workloads required for each application, and our inference platform is purpose-built to support both scenarios efficiently: Document embeddings are created for all documents in a database whenever they are added or updated, capturing the semantic meaning of the documents an application has access to. Typically generated in batch, they are optimized for scale and throughput. Query embeddings enable the system to effectively interpret the user's intent for relevant results. Produced for a user's search query at the moment it's made, they are optimized for low latency and high precision. Figure 1. Voyage AI's embedding workflow: Document and query processing in MongoDB. Voyage AI’s embedding and reranking models consistently outperform leading production-grade models across industry benchmarks. For example, the general-purpose voyage-3-large model shows up to 20% improved retrieval accuracy over widely adopted production models across 100 datasets spanning domains like law, finance, and code. Despite its performance, it requires 200x less storage when using binary quantized embeddings. Domain-specific models like voyage-code-2 also outperform general-purpose models by up to 15% on code tasks On the reranking side, rerank-lite-1 and rerank-1 deliver gains of up to 14% in precision and recall across over 80 multilingual and vertical-specific datasets. These improvements translate directly into better relevance, faster inference, and more efficient RAG pipelines at scale. MongoDB Atlas Search + Voyage AI models today MongoDB Atlas Vector Search enables powerful semantic retrieval with a wide range of embedding and reranking models. Developers can benefit from using Voyage models with Atlas Vector Search today, even before the deeper integration arrives. Figure 2. Example code for embedding and vector search with Voyage AI and MongoDB. “AI-powered search”, not “AI Search” Not all AI search experiences are created equal. As we begin integrating Voyage AI models directly into MongoDB Atlas, it’s worth sharing how we’re approaching this work. The best solutions today blend traditional information retrieval with modern AI techniques, improving relevance while keeping systems explainable and tunable. AI-powered search in MongoDB Atlas enhances traditional search techniques with modern AI models. Embeddings improve semantic understanding, and reranking models refine relevance. But unlike opaque AI stacks, this approach remains transparent, customizable, and efficient: More control: Developers can tune search logic and ranking strategies based on their domain. More flexibility: Models can be updated or swapped to improve on an industry-specific corpus of data. More efficiency: MongoDB handles both storage and retrieval, optimizing cost and performance at scale. With Voyage’s models integrated directly into Atlas workflows, developers gain powerful semantic capabilities without sacrificing clarity or maintainability. Building the MongoDB + Voyage AI “better together” story While MongoDB’s flexible query language unlocks powerful capabilities, Atlas Vector Search can require thoughtful setup, especially for advanced use cases. Users must select and fine-tune embedding models to fit specific use cases. Additionally, they must either rely on serverless model APIs or build and maintain infrastructure to host models themselves. Each insert of new data and search query requires independent API calls, adding operational overhead. As applications scale or when models need updating, managing these new data types in clusters introduces additional friction. Finally, integrating rerankers further complicates the workflow by requiring separate API calls and custom handling for reordering results. By natively bringing Voyage AI's industry-leading models to MongoDB Atlas, we will eliminate these burdens and introduce new capabilities that empower customers to deliver highly relevant query results with simplicity. MongoDB is actively integrating Voyage's embedding and reranking models into Atlas to deliver a truly native experience. These deep integrations will not only simplify the developer workflow but will also enhance accuracy, performance, and cost efficiency - all without the usual complexity of tuning disparate systems. And our ongoing commitment to partnering with innovative companies across AI and tech ensures that models from various providers remain supported within a collaborative ecosystem. However, adopting the native Voyage models allows developers to focus on building their applications while achieving the highest quality of information retrieval. Figure 3. Enhanced AI-powered retrieval with MongoDB and Voyage AI. As we work on these native integrations, we're actively exploring advanced capabilities to further enhance the Atlas platform. Our investigations focus on: Defining the optimal approach to multi-modal information retrieval, integrating diverse inputs like text and images for richer results. Developing instruction-tuned retrieval, which allows concise prompts to precisely guide model interpretations, ensuring searches align closely with user intent. For example, enabling a search for “shoes” to prioritize sneakers or dress shoes, depending on user behavior and preferences. Determining the best ways to integrate domain-specific models tailored to the unique needs and use cases of industries such as legal, finance, and healthcare to achieve superior retrieval accuracy. Making it easy to update and change models without impacting availability. Bringing additional AI capabilities into our expressive aggregation pipeline language Improving the ability to automatically assess model performance, with the potential to offer this capability to customers. Building the future of AI-powered search From RAG pipelines to AI-powered customer experiences, information retrieval is the backbone of real-world AI applications. Voyage’s models strengthen this foundation by surfacing better documents and improving final LLM outputs. We are building this future around four core principles, with accuracy at the forefront: Accurate: ensuring the precision of information retrieval is always our top priority, empowering applications to achieve production-grade quality and mass adoption. Seamless: built into existing developer workflows. Scalable: optimized for performance and cost. Composable: open, flexible, and deeply integrated. By embedding Voyage into Atlas, MongoDB offers the best of both worlds: industry-leading retrieval models inside a fully managed, developer-friendly platform. This unified platform allows models and data to work together seamlessly, empowering developers to build scalable, high-performance AI applications with precision at their core. Join our MongoDB Community to learn about upcoming events, hear stories from MongoDB users, and connect with community members from around the world.

April 24, 2025

Engineering Blog

MongoDB 8.0: Improving Performance, Avoiding Regressions

MongoDB 8.0 is the most secure, durable, available and performant version of MongoDB yet: it’s 36% faster in read workloads and 32% faster in mixed read and write workloads than MongoDB 7.0 In addition to benefitting customers, MongoDB 8.0’s performance has also brought significant benefits to our own internal applications, as my colleague Jonathan Brill recently noted in his own blog post . To achieve these improvements, we created an 8.0 multi-disciplinary tiger team focused on performance, and eventually expanded this team into a broad “performance army.” The work by these engineers led to new ideas on how to process simple queries and how writes are replicated. Combined with a new way of measuring performance, we also added a new way to catch the gradual performance loss over time due to many miniscule regressions. Figure 1. MongoDB 8.0 benchmark results. Benchmarking at MongoDB The MongoDB Engineering team runs a set of benchmarks internally to measure MongoDB’s performance. Industry standard benchmarks like YCSB, Linkbench, TPCC, and TPCH are run periodically on a variety of configurations and architectures, and these benchmarks are augmented by custom benchmarks based on customer workloads. By running these benchmarks in our continuous integration system, we ensure that developers do not make commits that are detrimental to performance. For instance, if any commit would regress a benchmark by more than 5% for our most important workloads, we would revert the commit. However, this threshold does not detect regressions of 0.1%, and there are thousands of commits per release (e.g., more than 9000 in MongoDB 8.0). During the release of MongoDB 7.0, we started to take this gradual accumulation of performance loss by tiny regressions of release over release regressions seriously, so we changed the rules of the game. We decided we could not ship MongoDB 7.0 unless it at least matched MongoDB 6.0’s performance on the most important benchmarks. We began investigating regressions and made changes to get performance back. Typically, we use tools like Intel VTune and Linux perf to find regressions across releases. With the release of MongoDB 7.0 approaching, engineers limited the scope of these fixes to reduce their risk to the release. Some proposed fixes were considered too risky. Other fixes didn’t deliver statistically significant performance improvements (Z-score > 1). Unfortunately, MongoDB lost performance with many tiny cuts at a time, and our team realized that it would take many tiny steps to improve it. We got performance back to MongoDB 6.0’s levels, but we weren't quite satisfied. We knew that what we started with MongoDB 7.0 would need to continue into MongoDB 8.0 as a first-tier concern from the start. The MongoDB 8.0 performance push For the release of MongoDB 8.0, we increased the priority of performance over other work and set the goal of matching MongoDB 4.4’s performance at the start. This release was chosen because it switched the default to Write Concern Majority for replica sets. This change in write concern improved MongoDB’s default durability guarantees but came with a loss in performance since the primary needs to wait for a second write on a second machine to be performed. Before the release of MongoDB 5.0, the default write concern was w:1; when a client inserted a document, the response was returned to the client as soon as the write was journaled to the local disk. With write concern majority, the MongoDB server waits for a majority of the nodes to write the document to disk before returning a response. On the primary, MongoDB server inserts the document in the collection, journals this change to disk, sends the change to the secondary where it also journals the document to disk and then inserts the document into its collection. Applying the change immediately to collection on the secondary minimizes the latency for the secondary reads. Figure 2. MongoDB replication writes in MongoDB 7.0. To start our journey to improving the performance of MongoDB 8.0, we created a multi-disciplinary tiger team of 10 people in August 2023, with myself as the leader. The team comprised two performance engineers, three staff engineers, two senior staff engineers, a senior lead, and one technical program manager. Our team of ten worked together to generate ideas, proofs of concept, and experiments. The team’s process was different from our normal process, as we focused on idea experimentation, versus making ideas production-ready. I gave the team free reign to make any changes they thought could help, and I encouraged experimentation—the MongoDB 8.0 performance tiger team was a safe space. This spirit of experimentation was both important and successful, as it led to new ideas that delivered several big improvements (which are highlighted below). We were able to try quick hacks and measure their performance without having to worry about making our work production quality. The big improvements Two of the big improvements we made to MongoDB 8.0 came out of this team: simple finds and replication latency. MongoDB supports a rich query language, but a lot of queries are simple ones to look up a document by a single _id field; the _id field always has a unique index. MongoDB optimized this with a special query plan stage called IDHACK—a query stage optimized to retrieve a single document with a minimal code path. When the tiger team looked at this code, we realized that it was spending a lot of time going through the general purpose query planning code paths before choosing the IDHACK plan. So, a tiger team member did an experiment to bypass the entire query planner and hard code reading from the storage engine. When this delivered significant improvements to the YCSB 100% read, we knew we had a winner. While we knew it could not be committed as-is, it did serve as motivation to improve the IDHACK code path in the server in a new code path called ExpressPlan. The query team took this idea and ran with it by expanding it further for updates, deletes, and other unique index lookups. Here are traces for MongoDB from LLVM XRay and Perfetto . The highlighted red areas show the difference between 7.0 and 8.0 for query planning for a db.coll.find({_id:1}) . Figure 3. Comparing MongoDB 7.0 and MongoDB 8.0. The second big change was how we viewed replicating changes in a cloud database. As explained above, on secondaries, MongoDB journals the writes and then applies it to the collection before acknowledging it back to the primary. During a team brainstorming session, a tiger team member asked, “what if we acknowledge the write as soon as it is journaled, but before we applied it to the collection in-memory?” This reduces the latency of the primary and speeds up writes in a replica set while still maintaining our durability guarantees. A second engineer ran with this idea, prototyped it quickly, and proved that it provided a significant performance boost in a week. Now that the idea was proven to be beneficial, we handed it to the replication team to ship this work. Shipping this change took three months because we had to prove it was correct in the TLA+ models for our replication system and all corner cases before we could ship it. Catching very small regressions To detect small regressions, it is important to have benchmarks with no or low noise. But if the threshold is too small, this creates needless noise and creates a very noisy or flakey test that developers will learn to ignore. Given the noisiness of various metrics such as latency and throughput, a tiger team member came up with the idea of simply counting instructions via Linux perf_event_open syscall. In this test, the code exercises the request processing code to do a simple MongoDB ping command. We run the ping command in a loop on a CI machine a few times and report the average instruction count. This test has a 0.2% tolerance and uses a hard code number. Developers can adjust the threshold up or down as needed, but this test has been a huge success as it allows us to detect regressions without spurious noise. Check out the benchmark on GitHub . From tiger team to (tiger) army A small tiger team can only do so much, and we didn’t want to create a situation in which one team ships features only for another team to clean up their work later. For example, the MongoDB 8.0 performance tiger team focused on a subset of benchmarks, but MongoDB’s performance is measured with dozens of benchmarks. From November 2023 to January 2024, we started implementing all the performance ideas that the tiger team implemented, but more work remained to improve performance. This was when we built a performance “army”—we enlisted 75 people from across the 11 MongoDB server teams to work on performance. In this phase of the project, engineers were charged with idea generation, and fixing performance issues allowed us to accomplish even more than the tiger team; the larger team finished eight performance projects and 140 additional tickets as part of this work. By bringing in additional team members, we were able to draw on ideas from a larger pool of database experts. This led to improvements in a wide variety of areas—like parsing of large $in queries, improvements to query yielding, making config.transactions a clustered collection, reworking locking in count less places, micro optimizations in authorization checks, and a change to a new TCMalloc memory allocator with lower fragmentation. Engineers also looked at improving common code such as namespace string handling, our custom code generation (we found tries helped speed up generated parsers), reducing memory usage, and choosing better data structures in some cases. To give people the time and space they needed to succeed, we gave them dedicated weeks of time to focus on this work in lieu of adding new features. We encouraged both experimentation and for people to go with their gut feelings for small improvements that didn’t appear to move the needle on performance. Because not every experiment succeeded, it was important to encourage each other to keep experimenting and trying in the face of failure. For example, in one failed experiment two engineers tried to use restartable sequences on Linux, but the change failed to deliver the improvements we wanted given their cost and complexity. On the other hand, custom containers and reader writer mutexes did deliver. For my part, the most impactful thing I did during this phase was to be a cheerleader and to support the team’s efforts in our performance push. Being positive and optimistic helped people push forward in their performance work even when ideas didn’t work out. Performance improvements take a village Overall, MongoDB 8.0 was our most successful release ever in terms of performance. Concerted, innovative work by a passionate team—and later an army—of engineers led to new ideas for performance and new ways of thinking. Performance work is neither easy nor straightforward. But by building a sense of community around our performance push, we supported each other and encouraged each other to deliver great performance improvements for MongoDB 8.0. To read more about how MongoDB raised the bar with the release of MongoDB 8.0, check out our Chief Technology Officer Jim Scharf’s blog post . And please visit the MongoDB 8.0 page to learn more about all of its features and upgrades.

April 2, 2025

Engineering Blog

MongoDB 8.0: Eating Our Own Dog Food

Key Takeaways We achieve real-world testing by adopting release candidates (RCs) on our internal production systems before finalizing a release. Our diverse internal workloads delivered unique insights. For instance, an internal cluster’s upgrade identified a rare MongoDB server crash and an inefficiency for a specific query shape introduced by a new MongoDB 8.0 feature. Issues encountered while testing MongoDB 8.0 internally were fixed proactively before they went out to customers. For example, during an upgrade to an 8.0 RC, one of our internal databases crashed and the issue was fixed in the next RC. Prerelease testing uncovered gaps in our automated testing, leading to improved coverage with additional tests. Using MongoDB 8.0 internally on mission-critical internal systems demonstrated its reliability. This gave customers confidence that the release could handle their demanding workloads, just as it did for our own engineering teams. Release jitters Every software release, whether it’s a new product or an update of an existing one, comes with an inherent risk: what if users encounter a bug that the development team didn’t anticipate? With a mission-critical product like MongoDB 8.0 , even minor issues can have a significant impact on customer operations, uptime, and business continuity. Unfortunately, no amount of automated testing can guarantee how MongoDB will perform when it lands with customers. So how does MongoDB proactively identify and resolve issues in our software before customers encounter them, thereby ensuring a seamless upgrade experience and maintaining customer trust? Catching issues before you do To address these challenges, we employ a combination of methods to ensure reliability. One approach is to formally model our system to prove the design is correct, such as the effort we undertook to mathematically model our protocols with lightweight formal methods like TLA+. Another method is to prove reliability empirically by dogfooding. Dogfooding (🤨)? Eating your own dog food—aka eating your own pizza, aka “dogfooding”—refers to a development process where you put yourself in customers’ shoes by using your own product in your own production systems. In short: you’re your own customer. Why dogfood? Enhanced product quality: Testing in a controlled environment can’t replicate the edge cases of true-to-life workloads, so real-world scenarios are needed to ensure robustness, reliability, and performance under diverse conditions. Early identification of issues: Testing internally surfaces issues earlier in the release process, enabling fixes to be deployed proactively before customers encounter them. Build customer empathy: Acting as users provides direct insight into customer pain points and needs. Engineers gain firsthand understanding of the challenges of using their product, informing more customer-centric solutions. Without dogfooding, things like upgrades are taken for granted and customer pain points can be overlooked. Boost credibility and trust: Relying on our own software to power critical internal systems reassures customers of its dependability. Dogfooding at MongoDB MongoDB has a strong dogfooding culture. Many internal services are built with MongoDB and hosted on MongoDB Atlas , the very same setup we provide our customers. Eating our own dog food is essential to our customer mindset. Because internal teams work alongside MongoDB engineers, acting as users bridges the gap between MongoDB engineers and their customers. Additionally, real-life workloads vet our software and processes in a way automated testing cannot. Release dogfooding With the release of MongoDB 8.0, the company decided to take dogfooding one step further. Driven by a company-wide focus on making 8.0 the most performant version of MongoDB yet, we embarked on an ambitious plan to dogfood the release candidates within our own infrastructure. Before, our release process looked like this: Figure 1. Releases without real-world testing. We wanted it to look more like this: Figure 2. Releases pregamed on internal clusters. Adding internal testing to the release process allows us to iterate long before we make the product available to customers. Whereas in the past we’d release and fix issues reactively as customers encountered them, using the release internally, before it got into customers’ hands, would uncover edge cases so we could fix them proactively. By acting as our own customers, we remove our real customers from the development cycle and build confidence in the release. The confidence team To tackle upgrades effectively, we assembled a cross-functional team of MongoDB engineers, Atlas SREs, and internal service developers. A technical program manager (TPM) was assigned to the effort to track progress and coordinate efforts across the team. Together, we enumerated the databases, scheduled upgrade dates, and assigned directly responsible individuals (DRIs) to each upgrade. To streamline communication, we created an internal Slack channel and invited everyone on the team to it. We agreed on a playbook: with the support of the team, the assigned DRI would upgrade their cluster and monitor for any issues. If something came up we would create a ticket in an internal Jira project and mention it in Slack for visibility. I took on the role of DRI for Evergreen database upgrades. Evergreen My team maintains the database clusters for Evergreen , MongoDB’s bespoke continuous integration (CI) system. Evergreen is responsible for running automated tests at scale against MongoDB, Atlas, the drivers, Evergreen itself, and many other products. At last count, each day Evergreen executes, in parallel, roughly ten years of tests per day and is on the critical path for many teams at the company. Evergreen runs on two separate clusters in Atlas: the application’s main replica set and a smaller one for our background job coordinator, Amboy . In terms of scale, the main replica set contains around 9.5TB of data and handles 1 billion CRUD operations per day, while the Amboy cluster contains about 1TB of data and handles 100 million CRUD operations per day. Because of Evergreen’s criticality to the development cycle, historically we’ve taken a cautious approach to any operational changes and database upgrades were not a priority. The initiative to dogfood our internal clusters changed our approach—we were going to use 8.0 before it went out to customers. Enabling a feature flag in Atlas made the RC build available in our Atlas project before it was available to customers. A showstopper Our first target was the Amboy cluster, which handles background jobs for Evergreen. I clicked the button to upgrade our Amboy cluster and we held our collective breath. Atlas upgrades are rolling. This means an upgrade is applied iteratively to each secondary in the cluster until finally the primary is stepped down and upgraded. Usually this works well since any issues will at most affect just a secondary, but in our case it didn’t work out. The secondaries’ upgrades succeeded, but when the primary was stepped down, each node that won the election to be the next primary crashed. The result was that our cluster had no primary and the Amboy database was unavailable, which threw a monkey-wrench in our application. We sounded the alarm and an investigation commenced ASAP. Stack traces, logs, and diagnostics were captured and the cluster was downgraded to 7.0. As it turned out, we’d hit an edge case that was triggered by a malformed TTL index specification with a combination of two irregularities: Its expireAfterSeconds was not an integer. It contained a weights field , which is not valid in an index that’s not a text index . Both irregularities were previously allowed, but became invalid due to strengthened validation checks. When a node steps up to primary, it corrects these malformed index specifications, but in that 8.0 RC if there were two things wrong with an index it would go down an execution path that ended in a segfault. This bug only occurs when a node steps up to primary, which is why it brought down our cluster despite the rolling upgrade. SERVER-94487 was opened to fix the bug and the fix was rolled into the next RC. When the RC was ready, we upgraded the Amboy database again and the upgrade succeeded. Not a showstopper Next up was the main database cluster for the Evergreen application. We performed the upgrade, and at first all indications were that the upgrade was a success. However, on further inspection a discontinuous jump had appeared in two of the Atlas monitoring graphs. Before the upgrade our Query Executor graph usually looked like this: Figure 3. Query Executor graph before the upgrade. Whereas after the upgrade it looked like this: Figure 4. Query Executor graph after the upgrade. This represented roughly a 5x increase in the rate per second of index keys and documents scanned by queries and query plans. Similarly, the Query Targeting graph looked like this before the upgrade: Figure 5. Query Targeting graph before the upgrade. Whereas after the upgrade it looked like this: Figure 6. Query Targeting graph after the upgrade. This also represented roughly a 5x increase to the ratio of scanned index keys and documents to the number of documents returned. Both these graphs indicated there was at least one query that wasn’t using indexes as well as it had been before the upgrade. We got eyes on the cluster and it was determined that a bug in index pruning (a new feature introduced in 8.0) was causing the query planner to remove the most efficient index for a contained $or query shape. This is when a query contains an $or branch that isn’t the root of the query predicate, such as A and (C or B) . For the 8.0 release this was listed as a known issue and disabled in Atlas, and index pruning was disabled entirely by the 8.0.1 release until we can fix the underlying issue in SERVER-94741 . Other clusters Other teams’ clusters followed suit, but their upgrades went off without a hitch. It’s to be expected that the particulars of each dataset and workload would trigger various edge cases. Evergreen’s clusters hit some while the rest did not. This brings out an important lesson: testing against a variegated set of live workloads raises the likelihood we’ll encounter and address all the issues our customers would have encountered. Continuous improvement Although we caught these issues before they reached customers, our shift-left mindset motivates us to catch them earlier in the process through automated testing. As part of this effort, we plan to add additional tests focused on upgrades from older versions of the database. The index pruning issue, in particular, was part of the inspiration for us to investigate property based testing –an approach that has already uncovered several new bugs ( SERVER-89308 ). SERVER-92232 will introduce a property based test specifically for index pruning. What’s next? All told, the exercise was a success. The 8.0 upgrade reduced Evergreen’s operation execution times by an order of magnitude: Figure 7. Drastically faster database operations after the upgrade. For customers, dogfooding uncovered novel issues and gave us the chance to fix them before they could disrupt customer workloads. By the time we cut the release we were confident we were providing our customers a seamless upgrade. Through the dogfooding process we discovered additional internal teams with services built on MongoDB. And now we’re further leaning in on dogfooding by building out a formal framework that will include those teams and their clusters. For the next release, this will uncover even more insights and provide greater confidence. Looking ahead, as our CTO aptly put it , "all customers demand security, durability, availability, and performance" from their technology. Our commitment to eating our own dogfood directly strengthens these very pillars. It's a commitment to our customers, a commitment to innovation, and a commitment to making MongoDB the best database in the world. Join our MongoDB Community to learn about upcoming events, hear stories from MongoDB users, and connect with community members from around the world.

March 3, 2025

Engineering Blog

Improving MongoDB Queries by Simplifying Boolean Expressions

Key takeaways Document databases encourage storing data in fewer collections with many fields, in contrast to relational databases’ normalization principle. This approach improves efficiency but requires careful handling of complex filters. Simplifying Boolean expressions can improve query performance by reducing computational overhead and enabling better plan generation. The solution uses a modified Quine–McCluskey algorithm and Petrick’s method on an efficient bitset representation of Boolean expressions. MongoDB's culture supports innovation by empowering engineers to tackle problems and solve them from beginning to end. Introduction Document databases like MongoDB encourage a strategy that involves storing data in fewer collections with a large number of fields. This is in contrast to the normalization principle of relational databases, which recommends spreading data across numerous tables with a smaller number of fields. Denormalization and avoiding complex joins is a source of efficiency for document databases. However, as the filters become complex, you must take care to handle them properly. Poorly handled complex filters can negatively affect database performance, resulting in slower query responses and higher resource usage. Addressing complex filters becomes a critical task. One optimization technique to mitigate the performance issues with complex filters is Boolean expression simplification. This involves reducing complex Boolean expressions into simpler forms, which the query engine can evaluate more efficiently. As a result, the database can execute queries faster and with less computational overhead. To demonstrate the importance of filter simplification, consider a real MongoDB customer case. The query in the case was enormous in size and clearly machine-generated, which the optimizer couldn't handle efficiently. It was clear that simplifying the query would help the optimizer find a better plan. Here's a smaller example inspired by that case: simplifying filters can lead to more efficient query plans. db.collection.createIndex({b: 1}) filter = {$or: [{$and: [{a: 1}, {a: {$ne: 1}}]}, {b: 2}]} db.collection.find(filter) The query involves predicates on the field “a” so the optimizer cannot generate an Index Scan plan for the unoptimized version of the query and opts for a collection scan plan. However, the simplified expression: {b: 2} makes the IndexScan plan possible. The difference between the plans can be drastic for large collections and selective indexes. In our benchmark, the simplifier showcased a remarkable 18,100% throughput improvement in this scenario. The benefits of the filter simplification can come in different flavors: in one of our benchmarks testing large queries, execution time was cut in half in a collection scan plan. This improvement is due to the faster execution of the simplified queries. Simplifying Boolean expressions: Modified Quine–McCluskey algorithm and Petrick’s method The journey to find a good solution for Boolean simplification was surprisingly complex. The final solution can be expressed in just four steps: Convert the simplifying expression into bitset DNF form Apply QMC reduction operation of DNF terms: (x ∧ y) ∨ (x ∧ ¬y) = x Apply Absorption law: x ∨ (x ∧ y) = x Use Petrick’s method for further simplification Yet we faced a few challenges along the way. This section will explore the challenges and how we addressed them. Simplifying AST The initial solution, applying Boolean laws directly to filters in the Abstract Syntax Tree (AST) format, which MongoDB's query engine uses, proved to be resource-intensive, consuming significant memory and CPU time. This outlines the first challenge in the journey: finding an efficient method to simplify Boolean expressions. One issue with the initial solution was that transformations could be applied repeatedly, leading to the same expression appearing multiple times. To address this, we needed to store previously observed expressions in memory and check each time if an expression had already been seen. Modified Quine–McCluskey The Quine–McCluskey algorithm, commonly used in digital circuit design, offers a different approach with a finite number of steps. The explanation of the algorithm might seem daunting but in essence, we just apply the following reduction rule to a pair of expressions: (x ∧ y) ∨ (x ∧ ¬y) = x This allows the pair of expressions to be reduced into one. A challenge with the Quine–McCluskey algorithm is that it requires a list of prime implicants as input. An implicant of a Boolean expression is a combination of predicates that makes the expression true, essentially a row of the truth table where the Boolean expression is true. To obtain a list of prime implicants from a given Boolean expression, we need to calculate a truth table. This involves executing the expression (2^n) times, producing up to (2^n) implicants. For an expression with 10 predicates, we need to evaluate the expression 1024 times. With 20 predicates, the number of evaluations is 1,048,576. This is impractical. Implicants are similar to Disjunctive Normal Form (DNF) minterms, as both evaluate the expression to be true. We could use DNF instead of implicants, but here is another challenge: even though the expressions below are equivalent: (a ∧ b) ∨ (a ∧ ¬b) a ∨ (a ∧ ¬b) Only the first one is represented via implicants: (a ∧ b) is an implicant of the expressions, and can be simplified by QMC, while (a) is not, as it says nothing about the value of predicate (b). Absorption law Tackling the previous challenge, we face another one: now we need to find a way where we can compensate for the reduced power of QMC. Fortunately, this one can be resolved using Absorption Law: x ∨ (x ∧ y) = x With its help, we can simplify the expression from the previous example a ∨ (a ∧ ¬b) to just a. Effective bitset representation One proved to be good optimization is to use bitwise instructions on a bitset representation of Boolean expressions instead of working with the AST representation. This approach boosts performance by taking advantage of the speed and simplicity of bitwise operations, which are generally quicker and more straightforward than working with the more complex AST structure. Petrick’s method Can we do better and ensure the expression stays accurate while further reducing redundancy? For example, for the given expression: (¬a ∧ ¬b ∧ ¬c) ∨ (¬a ∧ b ∧ ¬c) ∨ (a ∧ ¬b ∧ ¬c) ∨ (¬a ∧ b ∧ c) After some QMC reduction steps are applied, we end up with the following expression: (¬a ∧ ¬c) ∨ (¬b ∧ ¬c) ∨ (¬a ∧ b) However, this can be reduced further down to: (¬b ∧ ¬c) ∨ (¬a ∧ b) This redundancy comes from the fact that when we reduce two terms into one - the new term sort of covers (or we can say represents) two original ones, and some of the derivative terms can be removed without breaking the “coverage” of the original expression. To find the minimal “coverage” we use Petrick’s method . The idea of this method is just to express the coverage as a Boolean expression and then find the minimal combination of its predicates which evaluate the expression to be true. It can be done by transforming the expression into DNF and picking up the smallest minterm. Boolean expressions in MongoDB In the previous section, we developed the algorithm for simple Boolean expressions. However, we can't use it to simplify filters in MongoDB just yet. The MongoDB Query Language (MQL) has unique logical semantics, particularly regarding negation, handling missing values, and array values. Logical operators MQL supports the following logical operators: $and - regular conjunction operator. $or - regular disjunction operator. $not - Negation operator with special semantics, see below for details $nor - Negation of disjunction, which is equivalent to conjunction of negations (DeMorgan Laws) The only specific operator is $nor, which can be easily converted to the conjunction of negations. MQL negation semantics MongoDB's flexible schema can lead to some fields being missing in documents. For instance, in a collection with documents {a: 1, b: 1} and {a: 2}, the field b is missing in the second document. When dealing with negations in MQL, missing values play an important role. Negations will match documents where the specified field is missing. Let's look at a collection of four documents and see how they answer to different queries (see Table 1): Table 1:   Negating comparison expressions in MQL. As shown, the $not operator matches documents where the field "a" is missing. The $ne operator in MQL is only a syntactic sugar for {$not: {$eq: ...}}, which means it matches documents where the specified field is missing (see Table 2): Table 2:   Negating Equality Expressions in MQL. Let’s prove that for MQL negations Boolean algebra laws still hold. To do that we need to introduce new notations: not* - MQL negation exists - exists A means that the field of predicate A does exist not-exists - is a short for not (exists A), and means that the field of predicate A does not exist Being short for not (exists), not-exists behaves in the same way as a normal not: not-exists(A and B) = not-exists(A) or non-exists(B) not-exists(A or B) = not-exists(A) and non-exists(B) The following equation is always true by the definition of MQL negation: not* A = not A or not (exists A) We also state that the predicate A cannot be true when its field is missing: A and not-exists A = false This is true even for operator {$exists: 0} in MQL: e.g. {$and: [{a: {$not: {$exists: 0}}}, {a: {$exists: 0}}]} always returns an empty set of documents. Internally {$exists: false} is implemented through negation of {$exists: 1}, hence the following statements are also true in MQL: not* (not-exists A) = exists A not* (exists A) = not-exists A Complement laws We need to prove that: A and not* A = false A or not* A = true The proof of the first Complement law: A and not* A = A and (not A or not-exists A) = (A and not A) or (A and not-exists A) = false or false = false The proof of the second Complement law: A or not* A = A or (not A or not-exists A) = A or not A or not-exists A = true DeMorgan laws Proof of the first DeMorgan law not* (A and B) = (not* A) or (not* B): not* (A and B) = not (A and B) or not-exists (A and B) = (not A or not B) or (not-exists A or not-exists B) = (not A or not-exists A) or (not B or not-exists B) = not* A or not* B The proof of the second DeMorgan law not* (A or B) = (not* A) and (not* B) is a bit more complicated: not* (A or B) = not (A or B) or not-exists (A or B) = (not A and not B) or (not-exists A and not-exists B) = (not A or not-exists A) and (not B or not-exists A) and (not A or not-exists B) and (not B or not-exists B) = ((not* A) and (not* B)) and ((not A and not B) or (not A and not-exist B) or (not B and not-exists A) or (not-exists A and not-exists B)) = ((not* A) and (not* B)) and ((not* A) and (not* B)) = (not* A) and (not* B) // because (not* A) and (not* B) = (notA and not B) or (not A and not-exist B) or (not B and not-exists A) or (not-exists A and not-exists B) Involution law We need to prove that not* (not* A) = A: not* (not* A) = not* (not A or not-exists A) = not* (not A) and not* (not-exists A) = (A or not-exists A) and (exists A) = (A and exists A) or (not-exists A and exists A) = A or false = A Here we used DeMorgan law for not* proved in the previous section. MQL array values semantics In MQL, array values behave as expected and don’t change the semantics of Boolean operators. However, they alter the semantics of operands within predicates. It’s helpful to think of it as implicitly adding the word “contains” to the operators interpretation: {a: {$eq: 7}} means “array a contains a value equal 7” {a: {$ne: 7}}, which is always equivalent to {a: {$not: {$eq: 7}}, means “array does not contain a value equal 7” {a: {$gt: 7}} - “array a contains a value greater than 7”. Since the Boolean simplifier focuses on the logical structure of expressions, rather than the internal meaning of predicates, it's irrelevant whether the value is an array or scalar. The one important difference with array values semantics is that they behave differently on interval simplifications. If for a scalar value the following interval transformations are always valid, and invalid for arrays: scalar > 100 && scalar <= 1 => scalar: () // empty interval scalar < 100 && scalar >= 1 => scalar: [1, 100) These differences are important for building Index Scan Bounds. However, as previously mentioned, they are not significant for the Boolean simplifier. Implementation considerations We faced several challenges during implementation. To achieve maximum performance, we created our own version of bitsets in C++. The existing options, such as boost::dynamic_bitset, were too slow, and std::bitset lacked flexibility. Memory management was crucial, so we worked hard to reduce memory allocations. This included using an optimized storage for our bitset and a C++ polymorphic allocator for some algorithms. We released the simplifier cautiously, enabling it only for specific types of expressions. We plan to expand the range of expressions that can be simplified in the future. We also set a limit on the maximum number of terms in DNF. Before each transformation, we estimate the number of resulting terms. If the number is too high, we cancel the simplification. Conclusion In this post, we described an effective algorithm to simplify Boolean expressions. We used bitsets to represent query filters, modified the Quine-McCluskey algorithm, and applied Petrick's method. Finally, we proved that the algorithm works with MQL's Boolean algebra. This journey began, as does so much at MongoDB, with a real-world customer example. The customer was struggling with a complex query, and we developed the Boolean Expression Simplifier to help them overcome their specific challenge—and ended up enhancing MongoDB's performance for any user dealing with intricate filters. As noted before, in one particularly demanding case involving large collections and selective indexes, the Simplifier helped achieve an 18,100% throughput improvement" This project reflects MongoDB’s longstanding commitment to turning customer challenges into successes, and our practice of taking what we learn from individual projects and making broad improvements. We’re always working to improve MongoDB’s performance, and the simplifier does this—it enables faster and more efficient query execution. Finally, this project is a good example of how MongoDB's culture of trust and ownership can lead to impact. Our engineers are empowered to innovate from concept to completion, transforming theoretical ideas like Boolean expression minimization into tangible gains that directly improve the customer experience. Join our MongoDB Community to learn about upcoming events, hear stories from MongoDB users, and connect with community members from around the world.

January 30, 2025

Engineering Blog

Managing Data Corruption in the Cloud

Key Takeaways Silent data corruption—rare events in which data becomes corrupt in a way that is not readily detected—can impact systems of all kinds across the software industry. MongoDB Atlas, our global cloud database service, operates at petabyte scale, which requires sophisticated solutions to manage the risk of silent data corruption. Because MongoDB Atlas itself relies on cloud services, the systems we have engineered to safeguard customer data have to account for limited access to the physical hardware behind our infrastructure. The systems we have implemented consist of software-level techniques for proactively detecting and repairing instances of silent data corruption. These systems include monitoring for checksum failures and similar runtime evidence of corrupt data, methods of identifying corrupt documents by leveraging MongoDB indexes and replication, and processes for repairing corrupt data by utilizing redundant replicas. Introduction: Hardware corruption in the cloud As a cloud platform, MongoDB Atlas promises to free its customers from the work of managing hardware. In our time developing Atlas, however, some hardware problems have been challenging to abstract away. One of the most notable of these is silent data corruption. No matter how well we design our software, in rare cases hardware can silently fail in ways that compromise data. Imagine, for example, a distance sensor that detects an obstacle 10 meters away. Even if the many layers of software handling the sensor’s data function flawlessly (application, network, database, operating system, etc.), if an unlucky memory chip has a single bit flipped by radiation the value of this measurement could mutate to something like 26. 1 The consequences of this botched measurement would depend on how this data is used: in some cases it may introduce a blip in a vast amount of research data, 2 but in the wrong system it could be dangerous. Despite the rarity of events like this, global cloud systems like MongoDB Atlas operate at such a scale that these events become statistically inevitable over time, even in the presence of existing manufacturer defect screening. Our platform currently stores petabytes of data and operates nearly half a million virtual machines in cloud datacenters in dozens of countries; even random failures with odds as low as one in a hundred thousand become likely to appear in such an environment. Complicating this further is the reality that silent data corruption has many possible causes beyond rare, random failures like the example of radiation above. Recent research has identified notable rates of data corruption originating from defective CPUs in large-scale data centers, 3 and corruption can just as easily originate from bugs in operating systems or other software. Considering this scope, and with limited levels of access to the cloud hardware we use to power MongoDB Atlas, how can we best stay ahead of the inevitability of silent data corruption affecting customers on our platform? Our response to this problem has been to implement features both in the MongoDB database and the orchestration software that powers MongoDB Atlas for continuously detecting and repairing silent data corruption. Our systems are designed to be proactive, identifying potential issues before they ever affect customer data or operations. The way we use these systems can be described in three steps. First, Atlas proactively monitors for signals of corrupt data from across our fleet of databases by checking for certain logical truths about data in the course of runtime operations. Then, in the case that evidence of corruption is identified, we utilize features in MongoDB for scanning databases to pinpoint the location of corrupt data, narrowing down the site of corruption to specific documents. Finally, once we have enough information to identify a remediation plan, we repair corruption in coordination with our customers by leveraging the redundancy of MongoDB’s replica set deployment model. As a whole this approach gives us early visibility into new types of data corruption that emerge in our fleet, as well as the tools we need to pinpoint and repair corruption when it occurs. Proactively monitoring for evidence of corruption Fortunately for anyone interested in managing silent data corruption at scale, databases tend to tell you a lot about what they observe. In the course of a single day, the hundreds of thousands of database processes in the Atlas fleet generate terabytes of database logs describing their operations: connections received from clients, the details of startup and shutdown procedures, queries that are performing poorly. At their worst, logs at this scale can be expensive to store and difficult to parse, but at their best they are an indispensable diagnostic tool. As such, the first step of our strategy for managing corruption in Atlas is strategic log management. There are several points in the course of a MongoDB database’s operations where we can proactively validate logical assumptions about the state of data and emit messages if something appears to be corrupt. The most fundamental form of this validation we perform is checksum validation. A checksum is a very small piece of information that is deterministically generated from a much larger piece of information by passing it through a mathematical function. When the storage engine for an Atlas cluster writes data to disk, each block–or individual unit of data written–is accompanied by a checksum of the data. When that block is later read from disk, the storage engine once again passes the data through the checksum function and verifies that the output matches the checksum that was originally stored. If there is a mismatch, we have a strong reason to suspect that the stored information is corrupt; the database process then emits a log line indicating this and halts further execution. You can see this behavior in the MongoDB source code here . Figure 1: Checksum validation fails after corruption is introduced on the disk level. In addition to checksums, there are other opportunities for checking basic assumptions about the data we are handling during routine MongoDB operations. For example, when iterating through a list of values that is expected to be in ascending order, if we find that a given value is actually less than the one that preceded it we can also reasonably suspect that a piece of information is corrupt. Similar forms of validation exist in dozens of places in the MongoDB database. Successfully leveraging these types of runtime errors in the context of the entire Atlas fleet, however, comes with some challenges. We need to quickly detect these messages from among the flood of logs generated by the Atlas fleet, and, importantly, do so in a way that maintains data isolation for the many individual customer applications running on Atlas. Database logs, by design, reveal a lot about what is happening in a system; as the creators of a managed database service, exposing the full contents of database logs to our employees for corruption analysis is a non-starter, and so we need a more nuanced method of detecting these signals. To solve this problem, we implemented a system for detecting certain classes of error as Atlas database logs are generated and emitting high-level metadata that can be analyzed by our teams internally without revealing sensitive information about the content or operations of a given database. To describe this system, it is first useful to understand a pair of concepts that we often reference internally, and which play an important role in the development of the Atlas platform, the data plane and the control plane. The data plane describes the systems that manage the data that resides in a given Atlas cluster. It consists of the virtual network containing the cluster’s resources, the virtual machines and related resources hosting the cluster’s database processes, and storage for diagnostic information like database logs. As a whole, it consists of many thousands of individual private networks and virtual machines backing the Atlas fleet of database clusters. The control plane, on the other hand, is the environment in which the Atlas management application runs. It consists of separate networks hosting our own application processes and backing databases, and stores all of the operational metadata required for running Atlas including, for example, metadata about the configurations of the clusters that constitute the Atlas fleet. Figure 2: An Agent observes log line indicative of data corruption and communicates this to to the Atlas Control Plane. The flow of information between the two planes only occurs on a limited set of vectors, primary among these being the MongoDB Agent, a background process that runs locally on virtual machines in the Atlas data plane. The Agent serves as the primary orchestrator of a cluster’s database processes. Whenever an Atlas customer requests a change to their cluster–for example, upgrading the version of their database–their request results in some modification to metadata that resides in the control plane which is then picked up by the Agents in the data plane. The Agents then begin to interact with the individual database processes of the cluster to bring them to the desired state. The Agent, with its ability to access database logs inside the data plane, provides the tool we need to detect critical logs in a controlled manner. In fact, at the time we implemented the feature for ingesting these logs, the Agent was already capable of tailing MongoDB logs in search of particular patterns. This is how the Performance Advisor feature works, in which the Agent looks for slow query logs above a certain operation duration threshold to alert users of potentially inefficient data queries. For the purposes of corruption detection we introduced a new feature for defining additional log patterns for the Agent to look for: for example, a pattern that matches the log line indicating an invalid checksum when data is read from disk. If the Agent observes a line that matches a specified pattern it will send a message to the control plane reporting when the message was observed, in which process, along with high-level information–such as an error code–but without further information about the activity of the cluster. The next step of this process, once evidence of corruption is detected, is to assess the extent of the problem and gather additional information to inform our response. This brings us to the next piece of our corruption management system: once we become aware of the possibility of corruption, how do we pinpoint it and determine a remediation plan? Scanning databases to pinpoint identified corruption So far, we have outlined our system for detecting runtime errors symptomatic of corrupt data. However, the detection of these errors by itself is not sufficient to fully solve the problem of data corruption in a global cloud database platform. It enables early awareness of potential corruption within the Atlas fleet, but when it is time to diagnose and treat a specific case we often need more exhaustive tools. The overall picture here is not unlike the treatment of an illness: so far, what we have described is our system for detecting symptoms. Once symptoms have been identified, further testing may be needed to determine the correct course of treatment. In our case, we may need to perform further scanning of data to identify the extent of corruption and the specific information affected. The ability to scan MongoDB databases for corruption relies on two of the most fundamental concepts of a database, indexes and replication . These two basic features of MongoDB each come with certain simple logical assumptions that they adhere to in a healthy state. By scanning for specific index entries or replicated data that violate these assumptions, we are able to pinpoint the location of corrupt data, a necessary step towards determining a remediation path. Indexes–data structures generated by a database to allow for quick lookup of information–relate to the contents of a database following specific logical constraints. For example, if a given collection is using an index on the lastName field and contains a document with a lastName value of “Turing,” the value “Turing” should be present in that collection’s index keys. A violation of this constraint, therefore, could point to the location of corrupt data; the absence of an in-use lastName value in the index would indicate that either the index has become corrupt or the lastName value on the document itself has become corrupt. Because almost all indexes are specified by the customer, Atlas does not have control over how the data in a cluster is indexed. In practice, though, the most frequently-accessed data tends to be highly indexed, making index scanning a valuable tool in validating the most critical data in a cluster. Replicated data, similarly, adheres to certain constraints in a healthy state: namely, that replicas of data representing the same point in time should be identical to one another. As such, replicated data within a database can also be scanned at a common point in time to identify places where data has diverged as a result of corruption. If two of three replicas in a cluster show a lastName value of “Turing” for a given document but the third shows “Toring” 4 , we have a clear reason to suspect that this particular document’s data has become corrupt. Since all Atlas clusters are deployed with at least three redundant copies of data, replication is always available to be leveraged when repairing corruption on Atlas. This is, of course, easier said than done. In practice, performing integrity scanning of indexes and replicated data for a very large database requires processing a large amount of complex data. In the past, performing such an operation was often infeasible on a database cluster that was actively serving reads and writes. The db.collection.validate command was one of the first tools we developed at MongoDB for performing integrity scans of index data, but it comes with certain restrictions. The command obtains an exclusive lock on the collection it is validating, which means it will block reads and writes on the collection until it is finished. We still use this command as part of our corruption scanning strategy, but because of its limitations this means it is often only feasible to run on an offline copy of a database restored from a backup snapshot. This can be expensive, and comes with the overhead of managing additional hardware for performing validations on offline copies. With this in mind, we have been developing new tools for scanning for data corruption that are more feasible to run in the background of a cluster actively serving traffic. Our latest tools for detecting inconsistencies in replica sets utilize the replication process to perform background validation of data while a cluster is processing ordinary operations, and can be rate-limited based on the available resources on the cluster. When this process is performed, the primary node will begin by iterating through a collection in order, pulling a small batch of data into memory that stays within the bounds of a specified limit. It will make note of the range of the data being analyzed and produce an MD5 hash , writing this information to an entry in the oplog , a transaction log maintained by MongoDB that is replayed by secondary nodes in the database. When the secondaries of the cluster encounter this entry in their copies of the oplog, they perform the same calculation based on their replicas of the data, generating their own hash belonging to the indicated range of data at the specified point in time. By comparing a secondary’s hash with the original hash recorded by the primary, we can determine whether or not this batch of data is consistent between the two nodes. The result of this comparison (consistent or inconsistent) is then logged to an internal collection in the node’s local database . In this manner small batches of data are processed until the collection has been fully scanned. Figure 3: Data consistency between replicas is validated by leveraging the oplog. This form of online data consistency scanning has allowed us to scan our own internal databases without interruption to their ordinary operations, and is a promising tool for expanding the scale of our data corruption scanning without needing to manage large amounts of additional hardware for performing offline validations. We do, nonetheless, recognize there will be some cases where running an online validation may be unviable, as in the cases of clusters with very limited available CPU or memory. For this reason, we continue to utilize offline validations as part of our strategy, trading the cost of running additional hardware for the duration of the validation for complete isolation between the validation workload and the application workload. Overall, utilizing both online and offline approaches in different cases gives us the flexibility we need to handle the wide range of data characteristics we encounter. Repairing corruption Once the location of corrupt data has been identified, the last step in our process is to repair it. Having several redundant copies of data in an Atlas cluster means that more often than not it is straightforward to rebuild the corrupt data. If there is an inconsistency present on a given node in a cluster, that node can be rebuilt by triggering an initial sync and designating a known healthy member as the sync source. Triggering this type of resync is sufficient to remediate both index inconsistencies and replication inconsistencies 5 as long as there is at least one known, healthy copy of data in a cluster. While it is typically the case that it is straightforward to identify a healthy sync source when repairing corruption–truly random failures would be unlikely to happen on more than one copy of data–there are some additional considerations we have to make when identifying a sync source. A given node in an Atlas cluster may have already had its data resynced at some point in the past in the course of ordinary operations. For example, if the cluster was migrated to a new set of hardware in the past, some or all nodes in the cluster may have already been rebuilt at least once before. For this reason, it is important for us to consider the history of changes in the cluster in the time after corruption may have been introduced to rule out any nodes that may have copied corrupt data, and to separately validate the integrity of the sync source before performing any repair. Once we are confident in a remediation plan and have coordinated any necessary details with our customers, we leverage internal features for performing automated resyncs on Atlas clusters to rebuild corrupt data. Very often, these repairs can be done with little interruption to a database’s operations. Known healthy nodes can continue to serve application traffic while data is repaired in the background. Internal-facing functionality for repairing Atlas clusters has existed since the early days of the platform, but in recent years we have added additional features and levels of control to facilitate corruption remediation. In particular, in many cases we are able to perform optimized versions of the initial sync process by using cloud provider snapshots of healthy nodes to circumvent the sometimes-lengthy process of copying data directly between replica set members, reducing the overall time it takes to repair a cluster. In the rarer event that we need to perform a full logical initial sync of data, we can continue to perform this mode of data synchronization as well. After repair has completed, we finish by performing follow-up scanning to validate that the repair succeeded. We are still hard at work refining our systems for detecting and remediating data corruption. At the moment, much of our focus is on making our scanning processes as performant and thorough as possible and continuing to reduce the time it takes to identify new instances of corruption when they occur. With these systems in place it is our intention to make silent data corruption yet another detail of hardware management that the customers of Atlas don’t need to lose any sleep over, no matter what kinds of rare failures may occur. Join our MongoDB Community to learn about upcoming events, hear stories from MongoDB users, and connect with community members from around the world. Acknowledgments The systems described here are the work of dozens of engineers across several teams at MongoDB. Significant developments in these areas in recent years were led by Nathan Blinn, Xuerui Fa, Nan Gao, Rob George, Chris Kelly, and Eric Sedor-George. A special thanks to Eric Sedor-George for invaluable input throughout the process of writing this post. 1 Instances of alpha radiation, often in the form of cosmic rays, introducing silent data corruption have been explored in many studies since the 1970s. For a recent overview of literature on this topic, see Reghenzani, Federico, Zhishan Guo, and William Fornaciari. "Software fault tolerance in real-time systems: Identifying the future research questions." ACM Computing Surveys 55.14s (2023): 1-30 . 2 Five instances of silent data corruption introducing inaccurate results in scientific research were identified in a review of computing systems at Los Alamos National Laboratory in S. E. Michalak, et al , "Correctness Field Testing of Production and Decommissioned High Performance Computing Platforms at Los Alamos National Laboratory," SC '14: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, 2014 3 A recent survey of the CPU population of Alibaba Cloud identified a corruption rate of 3.61 per 10,000 CPUs, see Wang, Shaobu, et al. "Understanding Silent Data Corruptions in a Large Production CPU Population." Proceedings of the 29th Symposium on Operating Systems Principles. 2023. Research by Google on its own datacenters identified CPU corruption on “the order of a few mercurial cores per several thousand machines,” see Hochschild, Peter H., et al. "Cores that don't count." Proceedings of the Workshop on Hot Topics in Operating Systems. 2021. Research by Meta on its own datacenters found that “hundreds of CPUs” demonstrated silent data corruption across “hundreds of thousands of machines,” see Dixit, Harish Dattatraya, et al. "Silent data corruptions at scale." arXiv preprint arXiv:2102.11245 (2021). 4 In practice, it is rare that corrupt data results in a legible value; more often data in this state would be simply illegible. 5 There are other methods of rebuilding indexes in a MongoDB database beyond what is described here; see db.collection.reIndex for information on rebuilding indexes without triggering an initial sync.

December 9, 2024

Engineering Blog

MongoDB 8.0: Raising the Bar

I recently received an automated reminder that I was approaching a work anniversary, which took me somewhat by surprise. It’s hard to believe that it’s already been a year (to the day) that I joined MongoDB ! So I thought I’d take a moment to reflect on my MongoDB journey so far, share some exciting product updates, and signal where we’re headed next. Our customers I joined MongoDB because it built a product developers love. The innovation of MongoDB’s document model empowered developers to simply build. No longer encumbered by having to formalize and denormalize their data schema before their application was even designed, MongoDB enabled developers to interact with data in an intuitive JSON format, and made it easy to evolve data structures as the life of their application evolved. One of my first steps upon joining the company was to learn more about our customers. I was excited to learn that in addition to delighting developers, MongoDB had launched capabilities that enabled it to win mission-critical workloads from enterprise class customers—including 70% of the Fortune 100 and highly regulated global financial institutions, health care providers, and government agencies. I found it remarkable that customers could replicate data across AWS, Google Cloud, and Microsoft Azure in MongoDB Atlas (our fully-managed cloud database service) with just a few mouse clicks, and that some customers replicate data between the cloud and on premises using MongoDB Enterprise Advanced. This optionality struck me as powerful in the era of rapid advancements in AI, as it enables customers to easily bring their data to the best cloud provider for AI. Soon after I joined MongoDB, the team was firming up the development roadmap for the next version of MongoDB, and they asked for my input on the plan. The team was debating whether to focus on features developers would love, or governance capabilities required by large enterprises. I knew that ideally we would please all of our customers, so we had to try to make this an “and” and not an “or.” While I was new to MongoDB, from my 17+ years at AWS I learned that all customers demand security, durability, availability, and performance (in that order) from any modern technology offering. If a product or service doesn’t have those four elements, customers won’t buy whatever you’re selling. So as a team, we agreed that our next release of MongoDB—MongoDB 8.0—had to raise the bar for all of our customers, delivering great security, durability, availability, and performance. The plan We had less than a year before our target launch, so we knew we had to get moving, fast. My team and I brought MongoDB’s product and engineering organizations together to align on the plan for our next release. We set goals around delivering significant improvements in security, durability, and availability. And we set a line in the sand—that we weren’t going to release MongoDB 8.0 unless it was the best-performing version of MongoDB yet. Measuring the performance of a feature-rich database like MongoDB can be tricky, as customers run a wide range of workloads. So we decided to run a suite of benchmarks to simulate customer workloads. We also developed Andon cord -inspired automation that would automatically roll back any code contributions that regressed our performance metrics. Finally, a set of senior engineering leaders met regularly to review our progress and immediately escalated any blockers that could jeopardize our launch, so that we could quickly fix things. From my experience, I knew that great teams really respond when they’re given clear goals, and when they’re empowered to innovate, so I was excited to see what they would come up with. I’m proud to say that our product and engineering teams rose to the challenge. Announcing MongoDB 8.0 Today, I’m thrilled to announce the general availability of MongoDB 8.0 —the most secure, durable, available, and performant version of MongoDB yet! The team came up with architectural optimizations in MongoDB 8.0 that have significantly reduced memory usage and query times, and have made batch processing more efficient than previous versions. Specifically, MongoDB 8.0 features: 36% better read throughput 56% faster bulk writes 20% faster concurrent writes during data replication 200% faster on complex aggregations of times series data In making these improvements, we're seeing benchmarks for typical web applications perform 32% better overall. Here’s a breakdown of how MongoDB 8.0 performs against some of our benchmarks: Improved performance benefits all users of applications built atop MongoDB, and for MongoDB customers, it can mean reduced costs (due to an improved price/performance ratio). In addition to significant performance gains, MongoDB 8.0 delivers a wide range of improvements, including (but not limited to): Improving availability by delivering sharding enhancements to distribute data across shards up to 50 times faster and at up to 50% lower starting cost, with reduced need for additional configuration or setup. Improving support for a wide range of search and AI applications at higher scale and lower cost, via the delivery of quantized vectors—compressed representations of full-fidelity vectors—that require up to 96% less memory and are faster to retrieve while preserving accuracy. Enabling customers to encrypt data at rest, in transit, and in use by expanding MongoDB’s Queryable Encryption to also support range queries. Queryable Encryption is a groundbreaking, industry-first innovation developed by the MongoDB Cryptography Research Group that allows customers to encrypt sensitive application data, store it securely as fully randomized encrypted data in the MongoDB database, and run expressive queries on the encrypted data —with no cryptography expertise required. You might wonder why we’re so confident that customers are going to love MongoDB 8.0. Well, we’ve been acting as our own customer, and have moved our own applications over to 8.0. This approach is generally called “ dogfooding ,” but we think that “eating our own pizza” sounds more appetizing. Our internal build system—which our software developers use daily—is built atop MongoDB, and when we upgraded to MongoDB 8.0 we saw query latencies drop by approximately 75%! This was a double win, as it improved the performance of our own tooling, and it set our performance chat room abuzz with excitement in anticipation of delighting external customers. While results may vary based on your particular workload, the point is that we just couldn’t wait to share MongoDB 8.0’s performance gains with customers. Indeed, customers are also already seeing great results on MongoDB 8.0. For example, Felix Horvat, Chief Technology Officer at OCELL , a climate technology company in Germany, said: “With MongoDB 8.0, we have seen an incredible boost in performance, with some of our queries running twice as fast as before . This improvement not only enhances our data processing capabilities but also aligns perfectly with our commitment to resource efficiency. By optimizing our backend operations, we can be more effective in our climate initiatives while conserving resources—a true reflection of our dedication to sustainable solutions.” I encourage you to check out MongoDB 8.0 yourself. It’s available today via MongoDB Atlas, as part of MongoDB Enterprise Advanced for on-premises and hybrid deployments, and as a free download from mongodb.com/try with MongoDB Community Edition. In addition, customers upgrading from previous versions of MongoDB to 8.0 can find helpful upgrade guides on mongodb.com. What’s next? We’re excited for you to try MongoDB 8.0 and to share your feedback, as customer feedback helps us guide our roadmap for future releases. Going forward, please watch this space. Over the next few weeks, we’ll be publishing a series of engineering blog posts that dig into MongoDB’s investments in the technology behind MongoDB 8.0. We’re also planning posts about horizontal scaling in MongoDB 8.0, and one that will look closely at queryable encryption (QE), but let me know what you’d like to hear more about. It’s been an exciting year at MongoDB—I can’t wait to see what the next one has in store! –Jim

October 2, 2024

Engineering Blog

Ready to get Started with MongoDB Atlas?

Start Free