Benchmarking: Do it right or don't do it at all

If you don't know how to run a particular database, don't bother trying to run benchmarks on that database. For example, when EnterpriseDB sponsored OnGres to benchmark MongoDB against PostgreSQL, OnGres made a range of basic errors in their use of MongoDB. These errors meant that a few minutes of applying best practices to one of the benchmarks resulted in MongoDB execution times being orders of magnitude better.

Queries PostgreSQL
execution time
OnGres MongoDB
execution time
execution time
MongoDB vs PostgreSQL
percentage difference
Query A 1h 28m 15s 1h 8m 44s 22s 98.73%
Query B 41m 3s 1h 13m 3s 3m 30s 91.47%
Query C 48m 37s 1h 14m 25s 8m 37 s 82.28%
Query D** 1h 7m 23s 2h 23m 44s 41m 0s 39.15%
OLAP Benchmark
** We'll come back to Query D later.

As the MongoDB engineering team discovered when it investigated, that kind of sloppy methodology was repeated throughout OnGres's report.

  • OnGres used an unsupported, experimental driver without connection pooling on MongoDB while it used production level drivers and third-party connection pooling for PostgreSQL
  • OnGres explicitly said they did not tune MongoDB while extensively tuning PostgreSQL
  • OnGres did not follow documented MongoDB best practices
  • OnGres created custom, synthetic benchmarks with unrealistic workloads
  • OnGres created, then did not use, indexes or had different indexes on the tested databases
  • OnGres's custom benchmarks contained defective queries

When our team applied best practices and corrected the bad indexing, they found MongoDB running faster than PostgreSQL on the same benchmarks. Actually, so did Ongres:

For this test, if [Postgres] connection pooling were totally discarded, MongoDB would significantly excel in all cases except for the optimal performance point

A note in the OnGres Report

OnGres ran three benchmarks, testing transactions, OLTP, and OLAP performance, but made errors in the execution of each that led to each benchmark being fatally flawed. The pattern of errors reinforces our belief that vendors who want to benchmark should only benchmark the product they know, using industry standard benchmarks. They should then make those benchmarks reproducible and publish their results in full. Then, and only then, can users, customers, and independent analysts make a comparison.

Here's what else we found wrong in OnGres's benchmarks:

Unsupported Drivers in Play

We start with the transactions test. Production MongoDB drivers have connection pooling. This makes OnGres' selection of a MongoDB driver particularly strange. They decided to use an experimental, unsupported, non-production Lua driver for the sysbench implementation they created to perform one of their three tests. That driver has no connection pooling at all and was last updated two years ago. Sysbench uses Lua, so any reasonable tester would have looked for an alternative benchmark rather that start with such an unlevel playing field.

But OnGres went further and used the pgBouncer connection pool in front of their PostgreSQL instances giving them the ability to reuse connections and get much higher performance than they could achieve with MongoDB. Sysbench was, according to OnGres, the only option for OLTP performance testing, but our experts found in their benchmark repository that they had run YCSB tests with production drivers on all sides. They did not choose to publish those results. YCSB benchmarks for MongoDB were published in 2015 and the client code is available.

OnGres leaned heavily on these sysbench benchmarks in their analysis for their headlines but given the use of non-production, experimental MongoDB drivers with no facility for connection pooling versus production PostgreSQL drivers and pgbouncer connection pooling, there is no reasonable basis on which these results can even be compared.

Off-the-shelf versus fully-tuned

For all the tests, OnGres took MongoDB off-the-shelf and compared it with their own heavily tuned PostgreSQL. They also ignored a number of MongoDB best practices. Any benchmarking should happen with the same level of configuration applied to all tested products and any asymmetry in the level of configuration will introduce bias into the test results. OnGres also made the incredible claim that "In general, MongoDB does not require or benefit from significant tuning" - all software benefits from tuning for workloads, as evidenced by the page of tuned parameters for PostgreSQL that followed that claim.

In general, MongoDB does not require or benefit from significant tuning

A claim in the OnGres report

We asked our experts to tune the database and queries to an equivalent level so that the comparison would no longer have that asymmetry. Tuning like this is documented in our production notes, part of the MongoDB documentation. When we did, MongoDB was performing up to 240x faster than OnGres could achieve with PostgreSQL with query times reduced from hours to seconds.

Custom Benchmarking

Benchmarks should be as close as possible to real-world workloads to have any meaning. Custom synthetic benchmarks can often amplify quirks in a system or be written to be optimal for one system over another. In OnGres' case, there were two benchmarks which they created themselves. The OLTP benchmark was based on a teaching example for Python users written by a MongoDB developer advocate; it was written to show a way to move relational transactional code to MongoDB and was not optimized for performance, rather for readability.

Ongres took this, ported it to Java and then built benchmarking on top of that. This led to unnecessary uses of $lookup (JOIN) aggregation and other relational traits in MongoDB which are known to impact performance simply because MongoDB is not a relational database. The power and flexibility of the document model means that MongoDB developers would not model data as separate tables like this, and would therefore not be subject to the performance limitations indicated by OnGres.

Indexing is Essential

Now we come to the OLTP benchmark. There should be a parity between indexes created on each database under test. Indexes are what drive performance in a database. The original code the OLTP benchmark was built on has no indexes, as it wasn't optimized. OnGres instead created indexes but of four tables, only one was indexed on both MongoDB and PostgreSQL.

On MongoDB, some collections had no indexes, while on PostgreSQL, there were a range of additional indexes added to optimize joins. Lack of effective indexes drives any database to scan each table or collection record by record and it massively degrades performance. It was another case of heavily asymmetrical tuning by OnGres.

Defective Benchmarking

The OLAP benchmark ran only four queries over JSON data and it apparently showed PostgreSQL being faster than MongoDB. Although there were - this time - indexes created on both databases, the queries that were run on MongoDB did not use those indexes.

With the addition of a simple hint to direct the query to use the indexes, the MongoDB queries were a magnitude faster than PostgreSQL. MongoDB also recommends the use of compound indexes, something PostgreSQL documentation argues against. For MongoDB, the addition of some compound indexes got one query to run 98% faster than PostgreSQL.

And then there was the query which made no sense and was actually different between MongoDB and PostgreSQL. The team realized this was the case when we found further optimization of indexes for Query D returning in 20 milliseconds, rather than the 2 hours, 23 minutes, and 44 seconds reported by Ongres or the 42 minutes we report. It turns out that in addition to the other errors, the field being queried in Query D didn't exist in the database records. When we added a compound index for that field, both MongoDB and PostgreSQL could answer instantly with "there's nothing here to search".

TPC-C: A Recognized Benchmark

After we built transactions for MongoDB, our own Asya Kamsky adapted TPC-C to provide a performance baseline. Unlike OnGres’ work, Asya showed how following MongoDB best practices leads to high performance on a more realistic transactional workload. You can watch Asya’s talk from MongoDB World, or look for the upcoming paper she’s presenting at VLDB in August.

In Conclusion

At MongoDB we love being tested by experts. It makes us do better and that pays off for our users and customers. Unfortunately, we aren't always tested by experts. Sometimes we're tested by people who don't know MongoDB, and worse still, people who apparently don't care that they don't know MongoDB. We hate having to work through badly built, badly run benchmarks. It burns up time our engineers could spend making MongoDB even better, but it has to be done to counteract the confusion that such inexpertly executed benchmarks generate.

17 Jul: Heading on opening graph changed to clarify what the percentages refer to.