Evaluating MongoDB against some of industry’s toughest data safety, correctness, and consistency tests
Since debuting in 2013, Jepsen has established itself as one of the most rigorous distributed systems testing suites available in the industry today. It evaluates data correctness and safety in the face of extreme failure scenarios – including simultaneous network partitions, drifting systems clocks, and repeated node crashes. It is regarded by many in the database community as the gold standard in evaluating the behavior of distributed systems under critical and cascading infrastructure outages.
MongoDB has worked with Jepsen since 2015 to publicly evaluate database behavior in the face of multiple system failure scenarios.
MongoDB’s most recent joint testing with Jepsen was against MongoDB 3.6.4, with the Jepsen analysis concluding:
"After weeks of testing both insert-only and update-heavy workloads against sharded clusters, we’ve found that MongoDB’s v1 replication protocol appears to provide linearizable single-document reads, writes, and compare-and-set, through shard rebalances and network partitions."
"Thus far, causal consistency has generally been limited to research projects...MongoDB is one of the first commercial databases we know of which provides an implementation."
- To achieve linearizable reads, writes, and compare-and-set operations, the Jepsen analysis states users must configure read concern linearizable and write concern majority. This is also noted in the MongoDB read concern and write concern documentation.
- The outcome of the 3.6.4 tests are the result of significant multi-year engineering investments in MongoDB's distributed systems design including consistency, isolation, and durability controls; the implementation of a cluster-wide global logical clock; and an enhanced RAFT-based replication consensus protocol.
- We have complemented these investments by integrating Jepsen tests into the MongoDB Evergreen continuous integration suite. This ensures adherence to these specific Jepsen tests in MongoDB’s ongoing development work and future releases.
In May 2020, Jepsen tested MongoDB 4.2.6, introducing a new transaction analysis tool called Elle into the testing suite. The analysis observed anomalies in transactional behavior in the presence of multiple network failures on a sharded cluster.
The testing uncovered a bug that can lead to a previously committed write being incorrectly retried in the presence of a primary failover and a subsequent transaction commit retry. This bug has been fixed, and will be available to users in MongoDB 4.2.8 onwards. The MongoDB test suite has also been updated to ensure that this specific phenomenon is detected in future releases.
Resources to Learn More
- Implementation of Cluster-wide Logical Clock and Causal Consistency in MongoDB (pdf): SIGMOD 2019 paper.
- TPC-C benchmarking (pdf). Presented at the 2019 VLDB Conference, the paper discusses how MongoDB adapted TPC-C for a document database, and the performance results we measured.
- MongoDB Evolved: Summary of the key MongoDB product milestones over the past 7+ years.
- MongoDB Architecture Guide: Provides an overview of MongoDB technology.
- MongoDB Server Documentation