MongoDB’s JavaScript Fuzzer: Creating Chaos (1/2)
As MongoDB becomes more feature rich and complex with time, our need for more sophisticated bug-finding methods grows as well. We recently added a homegrown JavaScript fuzzer to our toolkit, and it is now our most prolific bug finding tool, responsible for finding almost 200 bugs over the course of two release cycles. These bugs span a range of MongoDB components from sharding to the storage engine, with symptoms ranging from deadlocks to data inconsistency. We run the fuzzer as part of our continuous integration system, Evergreen, where it frequently catches bugs in newly committed code.
In part one of two, we examine how our fuzzer hybridizes the two main types of fuzzing to achieve greater coverage than either method alone could accomplish. Part two will focus on the pragmatics of running the fuzzer in a production setting and distilling a root cause from the complex output fuzz tests often produce.
What's a fuzzer?
Fuzzing, or fuzz testing, is a technique of generating randomized, unexpected, and invalid inputs to a program to trigger untested code paths. Fuzzing was originally developed in the 1980s and has since proven to be effective at ensuring the stability of a wide range of systems, from filesystems to distributed clusters to browsers. As people attempt to make fuzzing more effective, two philosophies have emerged: smart, and dumb fuzzing. And as the state of the art evolves, the techniques that are used to implement fuzzers are being partitioned into categories, chief among them being “generational” and “mutational.” In many popular fuzzing tools, smart fuzzing corresponds to generational techniques, and dumb fuzzing to mutational techniques, but as we will see, this is not an intrinsic relationship. Indeed, in our case, the situation is precisely reversed.
Smart fuzzing
A smart fuzzer is one that has a good understanding of the valid input surface of the program being tested. With this understanding, a smart fuzzer can avoid getting hung up on input validation and focus on testing a program’s behavior. Testing that a program properly validates its input is important, but isn’t the goal of fuzz testing.
Many fuzzers rely on an explicit grammar to generate tests, and it is that grammar that makes those tests smart. But our command language is young, and we did not want to delay our fuzzer’s delivery by taking the time to distill a formal grammar. Instead, we borrow our knowledge of the MongoDB command grammar from our corpus of JavaScript integration tests, mutating them randomly to create novel test-cases. Thus, our mutational strategy results in a smart fuzzer.
This corpus of JavaScript integration tests has been a mainstay of our testing for many years. Evergreen feeds each test file to a mongo shell, which executes the commands within those files against MongoDB servers, shard routers, and other components we want to test. When the fuzzer runs, it takes in a random subset of these JS tests and converts them to an abstract syntax tree (AST) of the form understood by JavaScript interpreters. It then wreaks (controlled) havoc on the tree, by selectively replacing nodes, shuffling them around, and replacing their values. This way we generate commands with parameters that wouldn’t be encountered during normal testing, but preserve the overall structure of valid JavaScript objects.
For example, this code:
db.coll.find({x:1})

...which finds a document in collection coll with a field x having the value 1, becomes:
.png?auto=format%252Ccompress)
To begin fuzzing that AST, the fuzzer first traverses the tree to mark nodes that should be replaced. In this case, let's assume it has decided to replace the value of the ObjectExpression, a 1. This node is then replaced with a PLACEHOLDER
node, as follows:
.png?auto=format%252Ccompress)
As the fuzzer traverses the tree, it also picks up values that it thinks are interesting, which are usually primitive values like strings and numbers. These values are harvested and used to construct the final value of the placeholder nodes.
In this example, the placeholder is replaced with another ObjectExpression containing the key and value that it harvested elsewhere from the corpus:
.png?auto=format%252Ccompress)
When this tree is converted into code, it becomes a new test case:
db.coll.find(x:{$regex:'a\0b'})

...which finds a document with a field x that matches the regular expression a\0b.
A test very much like this one was acutally run by our fuzzer, and it turned out that MongoDB did not properly handle regex strings containing null bytes, so this test case caused the server to crash.
Lessons from our experience
AST>REGEX
Using an abstract syntax tree is a great strategy for fuzz testing. Previously, we had tried using a regex-based approach. This involved stringifying the tests and finding specific tokens to replace or shuffle. But regexes are opaque and fragile -- they are a nightmare to maintain, and it's very easy to introduce subtle mistakes that make the mutations less effective. Syntax trees, on the other hand, actually model the test code as objects with named properties, so the fuzzer code for performing substitutions clearly embodies the logic for doing so. Thus, ASTs are far easier to reason about and less error-prone.
There are open source libraries that turn code into ASTs for most languages; we used acorn.js.
Heuristic>Random
When implementing the mutational aspect of a fuzzer, noting what types of mutations are provoking the most bugs can yield benefits. Our initial implementation randomly chose which nodes to replace, but we observed that modified ObjectExpressions contributed to finding more new bugs, so we tweaked the probabilities to make more mutations happen on them.
Dumb fuzzing
Smart, AST-based mutation gives our fuzzer a familiarity with the input format, but it also guarantees blind spots, because the corpus is a finite list harvested from human-written tests. The school of dumb fuzzing proposes an answer to this shortcoming, advocating fuzzers that generate input randomly, without regard to validity, thereby covering areas the developer may have overlooked.
This is a bit of a balancing act. With no knowledge of the target program at all, the best a fuzzer could do would be to feed in a random stream of 0s and 1s. That would generally do nothing but trigger input validation code at some intervening layer before reaching the program under test... Triggering only input validation code is the hallmark of a bad fuzzer.
To put some dumb in our fuzzer without resorting to random binary, we generate values from a seed list. Since our test inputs are JavaScript objects comprised of MongoDB commands and primitive values, our seed list is composed of MongoDB commands and primitive types that we know from experience are edge cases. We keep these seed values in a file, and generate JavaScript objects using them as the keys and values. Here's a little excerpt:
var defaultTokens = {
 Infinity, -Infinity,
 NaN, -NaN,
 'a\0b', 'A\0B',
 '\u0000', '\u0000\u0000',
 '$',
 '$all',
 '$and',
 '$bitsAllClear'
 // etc.

These strings are drawn from our experience with testing MongoDB, but as far as the fuzzer is concerned, they are just strings, and it composes the test input from them without regard to what would be valid. Thus, our generational method produces dumbness.
It doesn't work like this
We’re trying to balance coverage with validation avoidance. To generate test input that has a chance of passing input validation, we could start with a template of a valid JavaScript object. The letters in this template represent placeholders:
{a:X, b:Y, c:Z}

We could then replace the capital letters with seed primitive values:
{a: 4294967296, b: '\0ab', c: NumberDecimal(-NaN)}

...and replace the lower case letters with seed MongoDB command parameters:
{create: 4294967296, $add: '\0ab', $max: NumberDecimal(-NaN)}

But this isn't a valid MongoDB command. Despite filling in a well-formated template from a list of valid MongoDB primitives, this generated input still only triggers the validation code.
Hybrid fuzzing
Mutational fuzzing leaves blind spots, and generational fuzzing on its own won't test interesting logic at all. But when combined, both techniques become much more powerful. This is how our fuzzer actually works.
As it mutates existing tests, every once in awhile, instead of pulling a replacement from the corpus, it generates an AST node from its list of seeds. This generational substitution reduces blind spots by producing a value not present in the corpus, while the mutational basis means the resulting command retains the structure of valid input, making it likely to pass validation. Only after it is deep in the stack does the program realize that something has gone horribly wrong. Mission accomplished.
Here’s an example of hybrid fuzzing in action, using a simplified version of a test that actually exposed a bug.
The fuzzer starts with the following corpus...
db.test.insert({x:1});
db.test.update({some: "object"}, ...);

...the first line of which becomes the following AST:
.png?auto=format%252Ccompress)
The ObjectExpression
is converted into a PLACEHOLDER
node, in the same manner as mutational fuzzing.
Then the fuzzer decides that instead of replacing the placeholder with a value from elsewhere in the corpus, it will replace it with a generated object. In this case, a newExpression
with a large NumberLong
as the argument.
.png?auto=format%252Ccompress)
This yields the following test:
db.test.insert({a: new NumberLong("9223372036854775808")});
db.test.update({}, {$inc: {a: 13.0}});

The result is that we insert a large 64-bit integer into MongoDB before later updating that value. When the actual test ran, it turned out that the new value would still be a large number, but not the correct one. The bug was that MongoDB stored the integer as a double internally, which only has 53 bits of precision. The fuzzer was able to find this through generating the large NumberLong, which did not appear in any test.
The combination of mutational fuzzing with the edge cases we seed to the generational fuzzer is an order of magnitude more powerful than writing tests for these edge cases explicitly. In fact, a significant portion of the bugs that the fuzzer found were triggered by values generated in this way.
Up next: Detection and Triage
These examples use cleaned up and minimal fuzz tests with all the complexity removed. In actuallity, they are hundreds of lines long and contain plenty of branching logic, and all that randomization produces uninteresting errors in the test code itself. Furthermore, becuase it tests random components of the target application, a fuzzer can't tell you what failed even when the error is legitimate. For a fuzzer to be useful, it needs a way to filter out all the noise, and to help a tester isolate a root cause. That's what we're going to talk about in part two.