Documents Are Everywhere

Eliot Horowitz

#Technical

Over the past decade, following MongoDB’s lead, a raft of new document databases have been introduced and legacy databases have added document capabilities. In 2017, Microsoft layered an API for MongoDB on top of Cosmos DB (at the time called “DocumentDB”, but no longer), and recently Amazon released DocumentDB, which presents a subset of the MongoDB query language atop their Aurora technology. The document model, and the MongoDB API in particular, is flourishing.

As MongoDB CEO Dev Ittycheria shows in his post "The Future Will Be Documented", it's easy to see why. Documents cover the superset of popular data models. They can handle key-value, relational, and graph data sets, parent-child, list/array, and other hierarchical relationships using both embedding and references to model relationships. Because documents map more naturally to in-memory data structures, developers can work with them more easily, and focus on building their application the way it makes the most sense, not on how to accommodate the database. Instead, documents accommodate developers, dramatically increasing productivity and speeding innovation.

Comparing Managed Services

Amazon DocumentDB is a managed database service like MongoDB Atlas, the MongoDB service we introduced nearly three years ago. How does it compare?

Six Years Behind

We found in our correctness testing (detailed below) that DocumentDB most resembles MongoDB 2.4, a six-year-old version of MongoDB. Atlas, of course, runs the current version, MongoDB 4.0, which introduced notable features like multi-document ACID transactions, change stream cursors for acting on data changes in real time, and new type conversion operators for the aggregation framework.

Unambitious Distributed Architecture

DocumentDB takes a narrower approach to distributed systems than Atlas does. To achieve its high durability and availability goals, DocumentDB relies on Aurora, a storage-layer technology that replicates data to six storage nodes, two per availability zone within a region. This simplifies operations and lets DocumentDB separate compute and storage concerns, but it also comes with trade-offs.

  1. All DocumentDB clusters are limited to residing in a single region. This is a severe restriction compared with Atlas, which allows replica sets to span the globe, providing low-latency reads to application nodes wherever they are.
  2. DocumentDB does not implement sharding, limiting its ability to scale out.
  3. DocumentDB lacks advanced features like Global Clusters, which intelligently routes locale-aware documents to specific shards around the world. Global Clusters ensure optimal latency to a document by automatically storing it close to its point of use, and enforce geo-location of documents, making GDPR compliance easy with MongoDB Atlas.
  4. DocumentDB does not implement the MongoDB API's tunable consistency options. Even in cases which call for higher throughput and reduced durability guarantees, like streaming IoT sensor data, user tracking, or large-scale social media platforms, clients must wait for all writes to reach a majority of those nodes.

Segregated

DocumentDB lacks integration to real-time event, code execution, or analytics facilities. Atlas integrates with all of these via MongoDB Stitch, our serverless application platform. Stitch provides database triggers to act on real time data changes, which in turn invoke lightweight serverless functions, another feature of Stitch. Within those functions, developers can integrate directly with third-party services, or use the AWS SDK to leverage the full capabilities of the AWS platform. Atlas also incorporates a built-in data explorer, a document-native business intelligence tool called MongoDB Charts, and the BI Connector, an SQL proxy that lets teams leverage the large ecosystem of legacy BI tools. DocumentDB is essentially stranded -- if you want to use its data, you have to build a custom application.

Challenges For Development

Before an application can be deployed to a managed database service, it has to be developed. DocumentDB makes that very difficult. There is no downloadable option, and the cheapest instance costs $200/month before even adding the cost of IO. Applications can't be developed locally against MongoDB because of compatibility issues, so it's not clear how a team is supposed to develop applications for DocumentDB.

Evaluating Compatibility and Performance

The DocumentDB documentation states that application migration is "as easy as changing the database endpoint to the new Amazon DocumentDB cluster", and that it offers "twice the throughput of currently available MongoDB managed services." Let's put that to the test.

Correctness

Compatibility Report Card

Based on our tests, DocumentDB most resembles MongoDB 2.4, a six-year-old version of MongoDB. We established that by running the MongoDB API test suite against DocumentDB; it passed only 35% of our correctness tests. In most cases, the tests failed because a MongoDB feature simply does not exist in DocumentDB.

  • In the query language 18 out of 25 aggregation stages are missing, along with over 80 operators (including the entire set of date-related operators), so DocumentDB will have a problem handling analytics workloads.
  • The join and graph operators are missing, so modeling relational or graph data is off the table, and full-text and geospatial indexes are also absent.
  • DocumentDB does support most of the BSON document standard, but does not include the Decimal number type, which will significantly complicate use of DocumentDB in financial and scientific applications.

Another significant gap in DocumentDB is role-based access control. DocumentDB users always have access to all the databases in a cluster, as per its documentation.

The full list of failures are quite beyond the scope of this article, but you can refer to the entire list linked from this github repository where we've posted our test results.

Given these glaring feature gaps, DocumentDB is infeasible for the more sophisticated use cases most people care about, and it's out of the question as a drop-in replacement for MongoDB.

Performance

We compared the performance of DocumentDB against Atlas using two benchmarks: YCSB and Socialite. The DocumentDB cluster used three r4.4xlarge instances, and the Atlas cluster used three M60 instances. This produced clusters with nearly identical costs. All writes in these tests were performed with w:majority, even those which would ordinarily use w:1 on Atlas, to normalize the test results.

YCSB

YCSB is a "lowest common denominator" type benchmark, and only uses primary key queries. We ran three YCSB workloads, each on two data sets. One data set was small enough to fit entirely in RAM, while the other was much larger than RAM. Based on our knowledge of how customers use MongoDB, all data sets used 2.5Kb documents containing 25 fields.

YCSB Results with 4MM Documents

YCSB Results with 81MM Documents

Atlas outperformed DocumentDB on all workloads tested except the 95% reads / 5% writes workloads.

During this testing, we found that DocumentDB crashed frequently during YCSB's load phase when we tried to run it on datasets containing more than 200 million documents. We were unable to determine the root cause of these crashes, but we measured failover times of between two and four minutes. In a real-world scenario this could lead to significant, repeated outages.

Socialite

Socialite is a benchmark we developed years ago to test MongoDB performance as part of our regression testing. Its workload simulates a social networking application, so it uses a more real-world access pattern that includes complex querying. Unlike YCSB, it can only be run against the MongoDB API, and until now has not been used to compare MongoDB to other databases, so it's not optimized in favor of Atlas.

Socialite exposed serious difficulties DocumentDB has with sophisticated queries. In multiple scenarios, DocumentDB's query optimizer ignored indexes and used collection scans, leading to very weak performance:

Socialite benchmark results

The testing harness we used to obtain these results is publicly available. You can use it to vet our results, or as a jumping-off point for any tests you want to conduct. We'd love to know what kind of results you see.

Overall, our results found that DocumentDB performs well with extremely simple find() statements, either for single documents or for ranges, using only primary keys. It begins to suffer, however, when we introduce writes into the mix, falls behind badly on write-heavy loads, and it had serious difficulties when we use anything beyond rudimentary query language operations.

Conclusions

Because its API is only 35% implemented, DocumentDB is infeasible as a drop-in replacement for MongoDB. It is a reasonable first-pass implementation of a document database, suitable for read-heavy workloads that stick to simple queries and do not support large-scale, distributed applications.

Documents have the ability to power better, general-purpose databases that are more suited to the full range of distributed, large-scale, real-time applications that are now the norm. MongoDB Atlas is the only service that gives developers the full power of MongoDB, the global reach and scale of real distributed systems, and the ability to run on every major public cloud.