MongoDB’s JavaScript Fuzzer: Harnessing the Havoc (2/2)
Fuzz testing is a method for subjecting a codebase to a tide of hostile input to supplement the test cases engineers create on their own. In
part one
of this pair, we looked at the hybrid nature of our fuzzer -- how it combines “smart” and “dumb” fuzzing to produce input random enough to provoke bugs, but structured enough to pass input validation and test meaningful codepaths. To wrap up, I’ll discuss how we isolate signal from the noise a fuzzer intrinsically produces, and the tooling that augments the root cause analyses we do when the fuzzer finds an issue.
An unbridled fuzzer creates too much noise
Fuzz testing is a game of random numbers. That randomness makes the fuzzer powerful... too powerful. Without some careful harnessing, it would just blow itself up all the time by creating errors within the testing code itself. Take the following block of code, which is something you would see in one of MongoDB's JavaScript tests:
while(coll.count() < 654321)
assert(coll.update({a:1}, {$set: {...}}))
This code does a large number of updates to a document stored in MongoDB. If we were to put it through the fuzzer, a possible test-case that the fuzzer could produce is this:
while(true)
assert(coll.update({}, {$set: {"a.654321" : 1}}))
The new code now tests something completely different. It tries to set the 654321th element in an array stored in all documents in some MongoDB collection.
Now, this is an interesting test-case. Using the $set operator with such a large array may not be something we thought of testing explicitly and could trigger a bug (
in fact it does
). But the interaction between the fuzzed true condition and the residual while loop is going to hang the test! Unless, that is, the assert call in the while loop fails, which could happen if the line defining coll in the original test (not shown here) is mutated or deleted by the fuzzer, leaving coll undefined. If the assert call fails, it would be caught by the Mongo shell and cause it to terminate.
But neither the hang nor the assertion failure are caused by bugs in MongoDB. They are just byproducts of a randomly generated test-case, and they represent the two classes of noise we have to filter out of our fuzz testing: branch logic and assertion failures.
Branch logic
To guard against accidental hangs, our fuzzer simply takes out all branching logic via AST manipulation. In addition to while loops, we remove
try/catch, break
and
continue
statements,
do/while
,
for
,
for/in
, and
for/of
loops. These language structures are defined in a static list.
Assertion failures
For the assertion failures, we wrap every single line of generated test code with a
try/catch
statement. All the logic will still be executed, but no client-side errors will propagate up and cause a failure.
After passing through this sanitizing phase, our earlier example now looks like this:
try {
assert(coll.update({}, {$set: {"a.654321" : 1}}))
} catch {}
So how does the fuzzer catch bugs?
Wrapping everything in a
try/catch
block keeps fuzzer-generated noise from overwhelming us with false positives, but it also prevents any bugs from surfacing through the client-side assertions our typical tests rely on! Indeed, a fuzzer has to rely on other mechanisms to detect the errors it provokes.
Tools for generic errors
The first set of tools are ones we’re using anyway, for finding segmentation faults, memory leaks, and undefined behavior. Even without a fuzz tester, we would still be using these
language runtime tools
, such as LLVM’s
address sanitizer
and
undefined behavior
sanitizer, but they become far more useful when a fuzzer is bombarding the test target with all its random input.
These tools are good for generic coding errors, but they don’t validate that your program is behaving as expected by end users. To catch issues with business logic, our fuzzer relies on assertions within the testing target that check for conditions it shouldn’t be in.
Assertions within the system under test
Many applications make liberal use of asserts to guard against illegal conditions, but fuzz testing relies on them to catch application logic errors. Its wreaks havoc in your codebase, and assumes you've instrumented your application's components such that havoc is noticed.
For example, when acting as a secondary in a MongoDB replica set, mongod has an
assertion to halt if it fails to write an operation
. If a primary node logs a write for its secondaries, they had better be able to perform the write as well, or we’ll wind up with serious data loss when failovers happen. Since these assertions are fatal errors, the testing framework immediately notices when fuzz tests trigger them.
A brief aside regarding the limitations of randomized testing
This is really the only way that assertions can be used to catch errors provoked by randomly generated tests. Assertions in the target program can be oblivious to the tests being run; indeed, they must hold true under all circumstances (including when the program is being run by a user). By contrast, assertions within tests must be specific to the test scenario. But we have already shown that fuzzer generated tests, by their nature, must not include fatal assertions. So under truly random conditions, a fuzzer will trigger no tailored assertions. This is a limitation of all randomized testing techniques, and is why any good testing framework must not rely solely on randomized testing.
Triaging a fuzzer failure
Tests that perform random code execution and rely on target system assertions have some downsides: the problems they find have no predefined purpose, many of the operations within them might be innocuous noise, and the errors they produce are often convoluted. Failures observed at a particular line of the test might rely on a state set up by previous operations, so parts of the codebase that may be unrelated have to be examined and understood.
Thus fuzzer failures require triage to find the smallest set of operations that trigger the problem. This can take significant human intervention, as with
this known issue
where calling
cursor.explain()
with concurrent clients causes a segmentation fault. The test that provoked this issue used a dozen clients performing different operations concurrently, so besides understanding what state the operations in the test set up, log messages from all the client and server threads had to be inspected manually and correlated with each other.
All this work is typical of triaging a fuzzer test failure, so we built a set of features that help developers sift through the chaos. These are specific to testing a MongoDB cluster across the network using JavaScript, but can be used as inspiration for all fuzzing projects.
We're only interested in lines of code that send commands to a MongoDB server, so the first step is to isolate those. Using our trusty AST manipulator, we add a print statement after every line of fuzzer code to record the time it took to run. Lines that take a non-trivial amount of time to run typically ran a command and communicated with the mongodb server. With those timers in place, our fuzz tests look like this:
var $startTime = Date.now();
try {
// a fuzzer generated line of code
} catch (e) {
}
var $endTime = Date.now();
print('Top-level statement 0 completed in', $endTime - $startTime, 'ms');
var $startTime = Date.now();
try {
// a fuzzer generated line of code
} catch (e) {
}
var $endTime = Date.now();
print('Top-level statement 1 completed in', $endTime - $startTime, 'ms');
// etc.
When we get a failure, we find the last statement that completed successfully from the log messages, and the next actual command that runs is where we begin our triage.
This technique would be sufficient for identifying the trivial bugs that can cause the server to crash with one or two lines of test code. More complicated bugs require programmatic assistance to find exactly which lines of test code are causing the problem. We bisect our way towards that with a breadth-first binary search over each fuzzer generated file. Our script recursively generates new tests containing each half of the failed code until any further removal no longer causes the test to fail.
The binary search script is not a cure-all though. Some bugs do not reproduce consistently, or cause hangs, and require a different set of tools. The particular tools will depend entirely on your product, but one simple way to identify hangs is to use a timer. We record the runtime of a testsuite, and if it takes an order of magnitude larger than the average runtime, we assume it has hung, attach a debugger, and generate a core dump.
Through the use of timers, print statements and binary search script, we're able to triage the majority of our failures quickly and correctly. There's no panacea for debugging... every problem is new and most require a bit of trial and error to get right. We are continuously investing in this area to speed up and simplify failure isolation.
Running the fuzzer in our continuous integration system
Fuzz testing is traditionally done in dedicated clusters that run periodically on selected commits, but we decided to include it as a test suite in
our continuous integration framework, Evergreen
. This saved us the effort of building out a new automated testing environment and saved us from dedicating resources to determine which commit the bug was introduced.
When a fuzzer is invoked periodically, finding the offending commit requires using a tool like
git-bisect
. With our approach of a mutational fuzzer that runs in a CI framework, we always include newly committed tests in the corpus. Every time the fuzzer runs, we pick 150 sets of a few dozen files each from the corpus at random, and run each set through the fuzzer to generate 150 fuzzed files. Each set of corpus files always includes new logic added to the codebase, which means the fuzzed tests are likely testing new code as well. This is a simple and elegant way for the fuzzer to "understand" changes to the codebase without the need for significant work to parse source files or read code coverage data.
When a fuzz test causes a failure, the downstream effect is the same as any other kind of test failure, only with the extra requirement of triage.
The fuzzer: a tester's best friend
Overall, the fuzzer has turned out to be one of the most rewarding tools in our test infrastructure. Building off our existing suite of JavaScript tests, we were able to significantly increase our coverage with relatively little effort. Getting everything right takes time, but to get a basic barebones system started, all you need is a set of existing tests as the corpus, a syntax-tree parsing for the language of your choice, and a way to add the framework to a CI system. The bottom line is that no matter how much effort is put into testing a feature, there will inevitably be that one edge case that wasn’t handled. In those face-palm moments, the fuzzer is there for you.
November 1, 2016