MongoDB Engineering Reports

Posts on Engineering efforts, acievments, and culture at MongoDB.

Getting Storage Engines Ready for Fast Storage Devices

Over the past two decades, performance of storage hardware increased by two orders of magnitude. First, with the introduction of solid state drives (SSD), then with the transition from SATA to PCIe, and finally with the innovation in non-volatile memory technology and the manufacturing process [ 1 , 7 ]. More recently, in April 2019, Intel released the first commercial Storage Class Memory (SCM). Its Optane DC Persistent Memory, built with 3D XPoint technology, sits on a memory bus and further reduces I/O latency [ 2 ]. While device access used to dominate I/O latency, the cost of navigating the software stack of a storage system is becoming more prominent as devices’ access time shrinks. This is resulting in a flurry of academic research and in changes to commercially used operating systems (OS) and file systems. Despite these efforts, mainstream system software is failing to keep up with rapidly evolving hardware. Studies [ 4 , 5 , 6 ] have shown that file system and other OS overhead still dominates the cost of I/O in very fast storage devices, such as SCMs. In response to these challenges, academics proposed a new user-level file system, SplitFS [ 6 ], that substantially reduces these overheads. Unfortunately, adopting a user-level file system is not a viable option for many commercial products. Apart from concerns about correctness, stability, and maintenance, adoption of SplitFS would restrict portability, as it only runs on Linux and only on top of the ext4-DAX file system. Fortunately, there IS something that can be done in software storage engines that care about I/O performance. Within MongoDB’s storage engine, WiredTiger, we were able to essentially remove the brakes that the file system applied to our performance without sacrificing the convenience it provides or losing portability. Our changes rely on using memory-mapped files for I/O and batching expensive file system operations. These changes resulted in up to 63% performance improvements for 19 out of 65 benchmarks on mainstream SSDs. Streamlining I/O in WiredTiger Our changes to WiredTiger were inspired by a study from UCSD [ 4 ], where the authors demonstrated that by using memory-mapped files for I/O and by pre-allocating some extra space in the file whenever it needed to grow, they could achieve almost the same performance as if the file system was completely absent. Memory-mapped files Memory-mapped files work as follows. The application makes an mmap system call, whereby it requests the operating system to “map” a chunk of its virtual address space to a same-sized chunk in the file of its choice (Step 1 in Fig.1). When it accesses memory in that portion of the virtual address space for the first time (e.g., virtual page 0xABCD in Fig. 1), the following events take place: Since this is a virtual address that has not been accessed before, the hardware will generate a trap and transfer control to the operating system. The operating system will determine that this is a valid virtual address, ask the file system to read the corresponding page-sized part of the file into its buffer cache, and Create a page table entry mapping the user virtual page to the physical page in the buffer cache (e.g., physical page 0xFEDC in Fig.1), where that part of the file resides (Step 2 in Fig 1). Finally, the virtual-to-physical translation will be inserted into the Translation Lookaside Buffer (TLB -- a hardware cache for these translations), and the application will proceed with the data access. Memory mapped files work as follows: (1) They establish a virtual memory area for the mapped file, (2) They place the virtual-to-physical address translation into the page table, (3) They cache the translation in the Translation Lookaside Buffer (TLB) Subsequent accesses to the same virtual page may or may not require operating system involvement, depending on the following: If the physical page containing the file data is still in the buffer cache and the page table entry is in the TLB, operating system involvement is NOT necessary, and the data will be accessed using regular load or store instructions. If the page containing the file data is still in the buffer cache, but the TLB entry was evicted, the hardware will transition into kernel mode, walk the page table to find the entry (assuming x86 architecture), install it into the TLB and then let the software access the data using regular load or store instructions. If the page containing the file data is not in the buffer cache, the hardware will trap into the OS, which will ask the file system to fetch the page, set up the page table entry, and proceed as in scenario 2. In contrast, system calls cross the user/kernel boundary every time we need to access a file. Even though memory-mapped I/O also crosses the user/kernel boundary in the second and third scenarios described above, the path it takes through the system stack is more efficient than that taken by system calls. Dispatching and returning from a system call adds CPU overhead that memory-mapped I/O does not have [ 8 ]. Furthermore, if the data is copied from the memory mapped file area to another application buffer, it would typically use a highly optimized AVX-based implementation of memcpy. When the data is copied from the kernel space into the user space via a system call, the kernel has to use a less efficient implementation, because the kernel does not use AVX registers [ 8 ]. Pre-allocating file space Memory-mapped files allow us to substantially reduce the involvement of the OS and the file system when accessing a fixed-sized file. If the file grows, however, we do need to involve the file system. The file system will update the file metadata to indicate its new size and ensure that these updates survive crashes. Ensuring crash consistency is especially expensive, because each journal record must be persisted to storage to make sure it is not lost in the event of a crash. If we grow a file piecemeal, we incur that overhead quite often. That is why the authors of SplitFS [ 6 ] and the authors of the UCSD study [ 4 ] both pre-allocate a large chunk of the file when an application extends it. In essence, this strategy batches file system operations to reduce their overhead. Our Implementation The team applied these ideas to WiredTiger in two phases. First, we implemented the design where the size of the mapped file area never changes. Then, after making sure that this simple design works and yields performance improvements, we added the feature of remapping files as they grow or shrink. That feature required efficient inter-thread synchronization and was the trickiest part of the whole design -- we highlight it later in this section. Our changes have been in testing in the develop branch of WiredTiger as of January 2020. As of the time of the writing, these changes are only for POSIX systems; a Windows port is planned for the future. Assuming a fixed-size mapped file area Implementing this part required few code changes. WiredTiger provides wrappers for all file-related operations, so we only needed to modify those wrappers. Upon opening the file, we issue the mmap system call to also map it into the virtual address space. Subsequent calls to wrappers that read or write the file will copy the desired part of the file from the mapped area into the supplied buffer. WiredTiger allows three ways to grow or shrink the size of the file. The file can grow explicitly via a fallocate system call (or its equivalent), it can grow implicitly if the engine writes to the file beyond its boundary, or the file can shrink via the truncate system call. In our preliminary design we disallowed explicitly growing or shrinking the file, which did not affect the correctness of the engine. If the engine writes to the file beyond the mapped area, our wrapper functions simply default to using system calls. If the engine then reads the part of the file that had not been mapped, we also resort to using a system call. While this implementation was decent as an early prototype, it was too limiting for a production system. Resizing the mapped file area The trickiest part of this feature is synchronization. Imagine the following scenario involving two threads, one of which is reading the file and another one truncating it. Prior to reading, the first thread would do the checks on the mapped buffer to ensure that the offset from which it reads is within the mapped buffer’s boundaries. Assuming that it is, it would proceed to copy the data from the mapped buffer. However, if the second thread intervenes just before the copy and truncates the file so that its new size is smaller than the offset from which the first thread reads, the first thread’s attempt to copy the data would result in a crash. This is because the mapped buffer is larger than the file after truncation and attempting to copy data from the part of the buffer that extends beyond the end of the file would generate a segmentation fault. An obvious way to prevent this problem is to acquire a lock every time we need to access the file or change its size. Unfortunately, this would serialize I/O and could severely limit performance. Instead, we use a lock-free synchronization protocol inspired by read-copy-update (RCU) [ 9 ]. We will refer to all threads that might change the size of the file as writers. A writer, therefore, is any thread that writes beyond the end of the file, extends it via a fallocate system call, or truncates it. A reader is any thread that reads the file. Our solution works as follows: A writer first performs the operation that changes the size of the file and then remaps the file into the virtual address space. During this time we want nobody else accessing the mapped buffer, neither readers nor writers. However, it is not necessary to prevent all I/O from occurring at this time; we can simply route I/O to system calls while the writer is manipulating the mapped buffer, since system calls are properly synchronized in the kernel with other file operations. To achieve these goals without locking, we rely on two variables: mmap_resizing: when a writer wants to indicate to others that it is about to exclusively manipulate the mapped buffer, it atomically sets this flag. mmap_use_count: a reader increments this counter prior to using the mapped buffer, and decrements it when it is done. So this counter tells us if anyone is currently using the buffer. The writer waits until this counter goes to zero before proceeding. Before resizing the file and the mapped buffer, writers execute the function prepare_remap_resize_file ; its pseudocode is shown below. Essentially, the writer efficiently waits until no one else is resizing the buffer, then sets the resizing flag to claim exclusive rights to the operation. Then, it waits until all the readers are done using the buffer. prepare_remap_resize_file: wait: /* wait until no one else is resizing the file */ while (mmap_resizing != 0) spin_backoff(...); /* Atomically set the resizing flag, if this fails retry. */ result = cas(mmap_resizing, 1, …); if (result) goto wait; /* Now that we set the resizing flag, wait for all readers to finish using the buffer */ while (mmap_use_count > 0) spin_backoff(...); After executing prepare_remap_resize_file , the writer performs the file-resizing operation, unmaps the buffer, remaps it with the new size and resets the resizing flag. The synchronization performed by the readers is shown in the pseudocode of the function read_mmap : read_mmap: /* Atomically increment the reference counter, * so no one unmaps the buffer while we use it. */ atomic_add(mmap_use_count, 1); /* If the buffer is being resized, use the system call instead of the mapped buffer. */ if (mmap_resizing) atomic_decr(mmap_use_count, 1); read_syscall(...); else memcpy(dst_buffer, mapped_buffer, …); atomic_decr(mmap_use_count, 1); As a side note, threads writing the file must perform both the reader synchronization, as in read_mmap, to see if they can use the memory-mapped buffer for I/O, and the writer synchronization in the case they are writing past the end of the file (hence extending its size). Please refer to the WiredTiger develop branch for the complete source code. Batching file system operations As we mentioned earlier, a crucial finding of the UCSD study that inspired our design [ 4 ], was the need to batch expensive file system operations by pre-allocating file space in large chunks. Our experiments with WiredTiger showed that it already uses this strategy to some extent. We ran experiments comparing two configurations: (1) In the default configuration WiredTiger uses the fallocate system call to grow files. (2) In the restricted configuration WiredTiger is not allowed to use fallocate and thus resorts to implicitly growing files by writing past their end. We measured the number of file system invocations in both cases and found that it was at least an order of magnitude smaller in the default configuration than in the restricted. This tells us that WiredTiger already batches file system operations. Investigating whether batching can be optimized for further performance gains is planned for the future. Performance To measure the impact of our changes, we compared the performance of the mmap branch and the develop branch on the WiredTiger benchmark suite WTPERF. WTPERF is a configurable benchmarking tool that can emulate various data layouts, schemas, and access patterns while supporting all kinds of database configurations. Out of 65 workloads, the mmap branch improved performance for 19. Performance of the remaining workloads either remained unchanged or showed insignificant changes (within two standard deviations of the average). Variance in performance of two workloads (those that update a log-structured merge tree) increased by a few percent, but apart from these, we did not observe any downsides to using mmap. The figures below show the performance improvement, in percent, of the mmap branch relative to develop for the 19 benchmarks where mmap made a difference. The experiments were run on a system with an Intel Xeon processor E5-2620 v4 (eight cores), 64GB of RAM and an Intel Pro 6000p series 512GB SSD drive. We used default settings for all the benchmarks and ran each at least three times to ensure the results are statistically significant. All but 2 of the benchmarks where mmap made a difference show significant improvements Overall, there are substantial performance improvements for these workloads, but there are a couple interesting exceptions. For 500m-btree-50r50u and for update-btree some operations (e.g., updates or inserts) are a bit slower with mmap, but others (typically reads) are substantially faster. It appears that some operations benefit from mmap at the expense of others; we are still investigating why this is happening. One of the variables that explains improved performance with mmap is increased rate of I/O. For example, for the 500m-btree-50r50u workload (this workload simulates a typical MongoDB load) the read I/O rate is about 30% higher with mmap than with system calls. This statistic does not explain everything: after all, read throughput for this workload is 63% better with mmap than with system calls. Most likely, the rest of the difference is due to more efficient code paths of memory-mapped I/O (as opposed to going through system calls), as observed in earlier work [8]. Indeed, we typically observe a higher CPU utilization when using mmap. Conclusion Throughput and latency of storage devices improve at a higher rate than CPU speed thanks to radical innovations in storage technology and the placement of devices in the system. Faster storage devices reveal inefficiencies in the software stack. In our work we focussed on overhead related to system calls and file system access and showed how it can be navigated by employing memory-mapped I/O. Our changes in the WiredTiger storage engine yielded up to 63% improvement in read throughput. For more information on our implementation, we encourage you to take a look at the files os_fs.c and os_fallocate.c in the os_posix directory of the WiredTiger develop branch . References [1] List of Intel SSDs. https://en.wikipedia.org/wiki/List_of_Intel_SSDs [2] Optane DC Persistent Memory. https://www.intel.ca/content/www/ca/en/architecture-and-technology/optane-dc-persistent-memory.html [3] Linux® Storage System Analysis for e.MMC with Command Queuing, https://www.micron.com/-/media/client/global/documents/products/white-paper/linux_storage_system_analysis_emmc_command_queuing.pdf?la=en [4] Jian Xu, Juno Kim, Amirsaman Memaripour, and Steven Swanson. 2019. Finding and Fixing Performance Pathologies in Persistent Memory Software Stacks. In 2019 Architectural Support for Program- ming Languages and Operating Systems (ASPLOS ’19). http://cseweb.ucsd.edu/~juk146/papers/ASPLOS2019-APP.pdf [5] Jian Xu and Steven Swanson, NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories, 14th USENIX Conference on File and Storage Technologies (FAST’16). https://www.usenix.org/system/files/conference/fast16/fast16-papers-xu.pdf [6] Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim, Aasheesh Kolli, and Vijay Chidambaram. 2019. SplitFS: reducing software overhead in file systems for persistent memory. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19). https://www.cs.utexas.edu/~vijay/papers/sosp19-splitfs.pdf [7] SDD vs HDD. https://www.enterprisestorageforum.com/storage-hardware/ssd-vs-hdd.html [8] Why mmap is faster than system calls. https://medium.com/@sasha_f/why-mmap-is-faster-than-system-calls-24718e75ab37 [9] Paul McKinney. What is RCU, fundamentally? https://lwn.net/Articles/262464/ If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.

March 16, 2020
Engineering Reports

Transpiling Between Any Programming Languages

Input is converted via an ANTLR parse tree and code generation to output MongoDB Compass, the UI for MongoDB, recently introduced a pair of features to make it easier for developers to work in their chosen language. Users can now export the queries and aggregations they build in the UI to their preferred language. Soon they will also be able to input them in their preferred language. Allowing developers the flexibility to choose between multiple input languages and multiple output languages when using Compass required us to build a custom solution in the form of a many-to-many transpiler. Most compilers are one-to-one, or less commonly, one-to-many or many-to-one. There are hardly any many-to-many transpilers. To avoid having to start from scratch, we leveraged the open source parsing tool ANTLR which provided us with a set of compiler tools along with preexisting grammars for the languages we needed. We successfully minimized the amount of additional complexity by coming up with a creative set of class hierarchies that reduced the amount of work needed from n² to n. Motivation MongoDB Compass is an application that provides a UI for the database and helps developers to iteratively develop aggregations and queries. When building queries, the application currently requires input to be made in a JavaScript-based query language called MongoDB Shell. Compass aggregation pipeline builder To enable developers to use their preferred programming language when developing aggregation pipelines and queries, we wanted to add functionality in two parts. First, we wanted to allow developers that are familiar with the MongoDB Shell to export queries they create in the language they need (Python, Java, etc.). Second, we wanted to allow developers to use their language of choice while building a query. To achieve both and allow users maximum flexibility, our system therefore needed to accept multiple input languages as well as generate multiple output languages in an efficient way. Compass Export to Language allows you to export a pipeline in the language of your choice At the basis of these features is sophisticated compiler technology in the form of a transpiler. A transpiler is a source-to-source compiler which takes the source code of a program written in one programming language as its input and produces the equivalent source code in another programming language. Since our transpiler currently supports extended JSON, also called BSON, we call it a BSON transpiler. While we currently support only a subset of each programming language, the transpiler is designed in a way that would allow us to extend support to include the entire language syntax. Design Approach The Compass application is designed with an extensible plugin architecture allowing us to build the transpiler as a standalone plugin. To work with the Electron framework our application is based on, our plugin needed to be executable in JavaScript. There are lots of different transpilers in JavaScript which we considered. However, for our use case, we needed any language to any language transformation with support for BSON, which meant we needed a custom solution. Compass queries always take the form of either an array of BSON documents (stages for the aggregation pipeline) or a BSON document (for other queries) containing the MongoDB query language. While this constraint reduces the scope of the problem for the BSON transpiler, the language subset we need to support is large and complex enough that we decided to treat the problem as if we were adding full-language support. The naive approach to building a source-to-source compiler supporting multiple languages would result in a polynomial amount of effort since the number of language combinations is the product of the number of input and output languages. We needed to build a sustainable design so that adding new input and output languages only requires building O(1) components per language. This reduces the entire problem to O(n) for the number of languages n. We achieved this by abstracting the problem into independent input and output stages that are loosely coupled by their interface to a shared, intermediate, in-memory data structure: a parse tree. The input language stage just needs to build the tree, and the output language stage just needs to read it. Most compilers have two primary stages: parsing and code generation. The parsing stage is responsible for turning the literal text of a program into a tree representing an abstraction of its meaning, and the code generation stage walks that tree and produces an output that can be executed -- generally a binary of machine or virtual machine instructions. A key observation is that a source-to-source compiler can be seen as a specialized compiler in which the code generation stage generates program text, in another user-friendly language, instead of machine code. The design of our transpiler stems from that concept. Parsing In order to process some source code, such as the string new NumberDecimal(5) , a lexical analyzer or lexer takes raw code and splits it into tokens (this process is known as lexical analysis). A token is an object that represents a block of text corresponding to one of the primitive components of the language syntax. It could be a number, label, punctuation, an operator, etc. In the parsing stage these tokens are then transformed into a tree structure that describes not only isolated pieces of the input code but also their relationship to each other. At this point the compiler is able to recognise language constructs such as variable declarations, statements, expressions, and so on. The leaves of this tree are the tokens found by the lexical analysis. When the leaves are read from left to right, the sequence is the same as in the input text. Stages of compiler processing: Input is transformed via lexical analysis into tokens, tokens are transformed via syntax analysis into an AST which is used to generate the output code We did not want to write our own parser and lexer since it is incredibly time consuming even for a single language, and we have to support multiple. Luckily, there are many "parser-generator" tools that efficiently generate syntax trees from a set of rules, called a grammar. These tools take an input grammar, which is hierarchical and highly structured, parse an input string based on that grammar, and convert it to a tree structure. The tricky part of using parser-generators is the tedious and error-prone process of writing the grammars. Writing a grammar from scratch requires a detailed knowledge of the input language with all of its edge cases. If the transpiler needs to support many programming languages, we would have to write grammars for each of the input languages which would be a huge task. Source-to-source transformation with ANTLR This is why we decided to use ANTLR , a powerful parser-generator that, most importantly, already has grammars for almost all programming languages of interest. ANTLR also has a JavaScript runtime, allowing us to use it in our Node.js project. We considered using the LLVM-IR, a different set of compiler technologies that compile to an intermediate high-level representation . This approach would then need a separate step to compile the intermediate representation into the target language. This is a common pattern for multi-platform compilers, like the Clang/ LLVM project . Unfortunately, there are not many existing compilers that go from the intermediate representation back to user programming languages. We would have had to write those compilers ourselves, so ultimately using LLVM would not have saved us much effort. The code snippet below illustrates the basic structure of a program that builds a parse tree for ECMAScript (Javascript) input source code. This code imports auxiliary lexer and parser files and lets ANTLR pull characters from the input string, create a character stream, convert it to a token stream and finally build a parse tree. // It takes only a few lines to go from a string input to a fully parsed traversable tree! const antlr4 = require('antlr4'); const ECMAScriptLexer = require('./lib/antlr/ECMAScriptLexer.js'); const ECMAScriptParser = require('./lib/antlr/ECMAScriptParser.js'); const input = 'new NumberDecimal(5)'; const chars = new antlr4.InputStream(input); const lexer = new ECMAScriptLexer.ECMAScriptLexer(chars); const tokens = new antlr4.CommonTokenStream(lexer); const parser = new ECMAScriptParser.ECMAScriptParser(tokens); const tree = parser.program(); The resulting parse tree inherits from the ANTLR-defined ParseTree class, giving it a uniform way to be traversed. Note that the parsing stage and resulting parse tree are determined by the input language; they are completely independent of the stage where we generate the output language into which we seek to translate the source code. This independence in our design allows us to reduce the number of parts we need to write to cover our input and output languages from O(n²) to O(n). Code Generation Tree types Using ANTLR for its library of pre-built grammars requires a slight compromise in our design. To understand why, it is necessary to understand the difference between two terms that are related and sometimes used interchangeably: a parse tree and an abstract syntax tree (AST). Conceptually these trees are similar because they both represent the syntax of a snippet of source code; the difference is the level of abstraction. An AST has been fully abstracted to the point that no information about the input tokens themselves remains. Because of this, ASTs representing the same instructions are indistinguishable, regardless of what language produced them. By contrast, a parse tree contains the information about the low-level input tokens, so different languages will produce different parse trees, even if they do the same thing. Abstract syntax tree and parse tree comparison given an input of "new NumberDecimal(5)" Ideally, our code generating stage would operate on an AST, not a parse tree, because having to account for language-specific parse trees introduces complexity we’d rather avoid. ANTLR4, however, only produces read-only parse trees. But the advantages of using ANTLR and its ready-made grammars are well worth that trade-off. Visitors Parse tree traversal Like most compilers, the BSON transpiler uses a visitor pattern to traverse parse trees. ANTLR not only builds a parse tree but it also programmatically generates a skeleton visitor class. This visitor class contains methods for traversing parse trees (one visit method for each type of node of the tree). All of these methods begin with visit and end with the name of the node that it will visit - e.g. visitFuncCall() or visitAdditiveExpression() . The node names are taken directly from the input language grammar file, so each visitor class and its methods are tailored to the input language grammar. In the ANTLR-generated visitor class, these methods do not do anything except recurse on the child nodes. In order for our visitor to be able to transpile code, we need to subclass the generated visitor class and override each visit method to define what to do with each type of node. Since the BSON transpiler is built to support multiple input languages, and each language will produce a different kind of parse tree, we need to create one custom visitor for each input language supported by Compass. However, as long as we avoid building a custom visitor for each combination of input and output language, we are still only building O(n) components. With this design, each visitor is responsible for traversing a single language parse tree. The visitor calls functions as it visits each node and returns the original text of the node or can transform this text the way we need. Starting from the root, the visitor calls the visit method recursively, descending to the leaves in a depth-first order. On the way down, the visitor decorates nodes with metadata, such as type information. On the way up, it returns the transpiled code. Generators With a brute force solution, the visit* methods of the visitor would contain code for generating the output language text. To generate multiple output languages, we would have to specialize each method depending on the current output language. Overall, this approach would subclass every language-specific visitor class once for every output language, or worse yet, put a giant switch statement in each of the visit* methods with a case for each output language. Both of those options are brittle and require O(n²) development effort. Therefore we chose to decouple the code for traversing the language-specific trees from the code for generating output. We accomplished this by encapsulating code-generation for each language into a set of classes called Generators, which implement a family of emit* methods, like emitDate and emitNumber used to produce the output code. Class composition Class dependency diagram Our design was informed by the need for the visitor to be able to call generator methods without needing to know which generator they were using. Since code generation actually has a lot in common regardless of the output language, we wanted to implement a system where we could abstract the default behavior as much as possible and leave the generator to only handle edge cases. We chose to make use of JavaScript’s dynamic mechanics for inheritance and method dispatch by having the generator class inherit from the visitor class. Because JavaScript does not require that methods be defined before they are called, the visitor can call emit methods on itself that are actually defined in the generator and the generator can call visitor methods to continue the tree traversal. Using a generator class determined by the output language and a visitor class determined from the input language, we are able to compose a transpiler on-the-fly as it is exported. Generators are similar to an abstract interface, except there are no classic interfaces in JavaScript. As illustrated in the code snippet below, for each language combination our application creates a specialized transpiler instance composed of the corresponding visitor and generator classes. When our application receives a piece of code from the user, it creates a parse tree. The transpiler then visits the parse tree, using the ParseTreeVisitor’s visit method inherited from our custom Visitor subclass, and the language-specific, ANTLR generated visitor class (such as ECMAScriptVisitor). // Each composite transpiler instance has the ability to traverse the parse tree // for a specific language with its 'visit*' methods, and generate output code for // another language with its 'emit*' methods. const getJavascriptVisitor = require('./codegeneration/javascript/Visitor'); const getJavaGenerator = require('./codegeneration/java/Generator'); const getPythonGenerator = require('./codegeneration/python/Generator'); ... const loadJSTree = (input) => { /* Lexing and parsing the user input */ ... }; /** * Compose a transpiler and return a compile method that will use that transpiler * to visit the tree and return generated code. * * @param {function} loadTree - the method takes in user input and returns a tree. * @param {Visitor} visitor - the input-language visitor. * @param {function} generator - returns a generator that inherits from it’s arg. * * @returns {function} the compile function to be exported */ const composeTranspiler = (loadTree, visitor, generator) => { const Transpiler = generator(visitor); const transpiler = new Transpiler(); return { compile: (input) => { const tree = loadTree(input); return transpiler.start(tree); } }; } module.exports = { javascript: { java: composeTranspiler( loadJSTree, getJavascriptVisitor(JavascriptANTLRVisitor), // Visitor + ANTLR visitor getJavaGenerator // Method that takes in a superclass, i.e. the visitor ), python: composeTranspiler( loadJSTree, getJavascriptVisitor(JavascriptANTLRVisitor)), getPythonGenerator ), ... }, ... } Tree Traversal Example Simple Nodes In the most straightforward case, think the JavaScript snippet of text "hello world" , the first thing the custom visitor class needs to do is specify the entry point for the tree traversal. Since the entry nodes in different languages have different names (i.e. file_input in Python, but program in JavaScript), we define a method in each visitor called start that calls the visit method for the root node for that input language. That way our compiler can simply call start on each visitor without having to worry what the root node is called. // Entry point for the tree traversal class Visitor extends ECMAScriptVisitor { start(ctx) { return this.visitProgram(ctx); } } The default behavior of the ANTLR visit methods is to recur on each child node and return the results in an array. If the node doesn’t have any children, then the visit method will return the node itself. So if we do not overwrite any of the ANTLR methods, then the return value of our start method would be an array of nodes. To go from returning nodes to returning strings in our simple "hello world" example, we first overwrite the visitTerminal method so that the leaf nodes will return the raw text stored in the node. We then modify the visitChildren method so that instead of putting the results of visiting each child node into an array, the results get concatenated into a string. Those two changes are enough for our "hello world” example to be fully translated into a language that uses the same string representation, like Python. // Overwriting of 'visitTerminal()' method class Visitor extends ECMAScriptVisitor { start(ctx) { return this.visitProgram(ctx); } // Visits a leaf node and returns a string visitTerminal(ctx) { return ctx.getText(); } // Concatenate the results of recurring on child nodes visitChildren(ctx) { return ctx.children.reduce( (code, child) => `${code} ${this.visit(child)}`, '' ); } } Transformations However, we cannot always just concatenate the text values of the terminal nodes to form the result. Instead we need to transform floating point numbers, as well as numbers in different numeral systems without losing any precision. For string literals we need to think about single and double quotes, escape sequences, comments, spaces and empty lines. This type of transformation logic can be applied to any type of node. Let’s look at a concrete example: In Python, an object property name must be enclosed in quotes ( {'hello': 'world'} ); in JavaScript this is optional ( {hello: 'world'} ). In this particular case, this is the only one modification we need in order to transform a fragment of JavaScript code into Python code. // Transformation of JavaScript code into Python code class Visitor extends ECMAScriptVisitor { ... visitPropertyExpressionAssignment(ctx) { const key = this.visit(ctx.propertyName()); const value = this.visit(ctx.singleExpression()); if ('emitPropertyExpressionAssignment' in this) { return this['emitPropertyExpressionAssignment']; } return `${key}: ${value}`; } } The propertyExpressionAssignment node has two child nodes ( propertyName and singleExpression ). To get the values of these two child nodes, we need to traverse them separately as left hand side and right hand side subtrees. Traversing subtrees returns the original or transformed values of the child nodes. We can then build a new string using retrieved values here to make up the transformed code fragment. Instead of doing this in the visitor directly, we check if the corresponding emit method exists. If the visitor finds a proper emit method, it will delegate the transformation process to the generator class. By doing this we free our visitors from knowing anything about the output language. We just assume that there is some generator class that knows how to handle the output language. However, If this method doesn’t exist, the visitor will return the original string without any transformation. In our case we assume a emitPropertyExpressionAssignment was supplied and this will return the transformed JavaScript string. Processing In more complex cases, we must do some preprocessing in the visitor before we can call any emit methods. For example, date representations are a complex case because dates have a wide range of acceptable argument formats across different programming languages. We need to do some preprocessing in the visitor so we can ensure that all the emit methods are sent the same information, regardless of input language. In this case of a date node, the easiest way to represent date information is to construct a JavaScript Date object and pass it to the generator. Node types that need pre-processing must have a process* method defined in the visitor to handle this pre-processing. For this example it would be called processDate . // 'processDate()' creates a date object to pass it to the emit method processDate(node) { let text = node.getText(); // Original input text for this node let date; try { date = this.executeJavascript(text); // Construct a date object in a sandbox } catch (error) { throw new BsonTranspilersError(error.message); } if ('emitDate' in this) { return this.emitDate(node, date); } ... } For this processDate method, since we are compiling JavaScript and the transpiler is written in JavaScript, we took a shortcut: executing the users input to construct the Date. Because it has already been tokenized we know exactly what the code contains so it is safe to execute in a sandbox. For processing dates in other language we would instead parse the results and construct the date object through arguments. Upon completion, the process method will then call the respective emit* method, emitDate and pass it the constructed Date as an argument. Now that we can call the required process and emit methods from the visitor’s appropriate visit method. // This is a generator that generates code for Python. // The 'emitDate()' method is defined in the Generator and called from the Visitor module.exports = (superClass) => class Generator extends superClass { emitDate(node, date) { const dateStr = [ date.getUTCFullYear(), date.getUTCMonth() + 1, date.getUTCDate(), date.getUTCHours(), date.getUTCMinutes(), date.getUTCSeconds() ].join(', '); return `datetime.datetime(${dateStr}, tzinfo=datetime.timezone.utc)`; } }; Given the input string Date(‘2019-02-12T11:31:14.828Z’) , the root of the parse tree will be a FuncCallExpression node. The visit method for this node will called visitFuncCallExpression() . // Example of a Visitor class that extends the ECMAScript grammar class Visitor extends ECMAScriptVisitor { /** * Visits a node that represents a function call. * * @param {FuncCallExpression} node - The tree node * @return {String} - The generated code */ visitFuncCallExpression(node) { const lhs = this.visit(node.functionName()); const rhs = this.visit(node.arguments()); if (`process${lhs}` in this) { return this[`process${lhs}`](node); } if (`emit${lhs}` in this) { return this[`emit${lhs}`](node); } return `${lhs}${rhs}`; } ... } The first thing the visit method does is recurse on its two child nodes. The left-hand-child represents the function name node, i.e. Date . The right-hand-child represents the arguments node, i.e. 2019-02-12T11:31:14.828Z . Once the method retrieves the name of the function it can check to see if that function requires any preprocessing. It checks if the processDate method is defined and, failing that check, whether an emitDate method is defined. Even though the emitDate method is defined in the generator, since the visitor and generator are composed into one class, the visitor treats emit methods as if they were its own class methods. If neither method exists, the visit* method will return a concatenation of the results of the recursion on the child nodes. Every input language has its own visitor that can contain processing logic and every output language has its own generator that contains the required transformation logic for the specific language. As a rule, transformations required by all output languages will happen as processing logic, while all other transformations happen in the generator. With this design, different transpilers based on different visitors can use the same generator methods. That way, for every input language we add, we only need to define a single visitor. Similarly, for every output language we add, we only need to define a single generator. For n languages we want to support, we now have O(n) amount of work instead of having to write one visitor-generator for every language combination. Conclusion The Compass BSON transpiler plugin has the potential to parse and generate MongoDB queries and aggregations in any programming language. The current version supports several input (MongoDB Shell, Javascript, and Python) and output (Java, C#, Python, MongoDB Shell, and Javascript) languages. The BSON transpiler plugin is built as a standalone Node.js module and can be used in any browser-based or Node.js application with npm install bson-transpilers . As many other MongoDB projects, the BSON transpiler plugin is open-source, you can go to the repo and we welcome contributions. If you want to contribute to the Compass BSON transpiler, please check our contributing section on GitHub . When writing the BSON transpiler, we were guided by general compiler design principles (lexical analysis, syntax analysis, tree traversal). We used ANTLR to reduce the amount of manual work required to parse the input languages of interest, which allowed us to focus mostly on modularizing the code generation process. A major benefit of modularizing the language definitions is that a user can contribute a new output language without needing to know anything about the input languages that are currently supported. The same rule applies for adding a new input language: you should be able to define your visitor without needing to care about the existing generators. The latest version of the BSON transpiler plugin is more complex and powerful than what has been covered by the current blog post. It supports a wider range of syntax through the use of a symbol table. It also includes the entire BSON library, function calls with arguments and type validation, and informative error messages. On top of that, we have added a high level of optimization by using string templates to abstract a lot of the code generation. All of these developments will be described in a future blog post. Written by Anna Herlihy , Alena Khineika , & Irina Shestak . Illustrations by Irina Shestak Further Reading Compiler in JavaScript using ANTLR by Alena Khineika; Compiler Construction by Niklaus Wirth; Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Monica S. Lam, Ravi Sethi and Jeffrey D. Ullman; The Elements of Computing Systems by Noam Nisan and Shimon Schocken. If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.

July 1, 2019
Engineering Reports

Repeatable Performance Tests: CPU Options Are Best Disabled

In an effort to improve repeatability, the MongoDB Performance team set out to reduce noise on several performance test suites run on EC2 instances. At the beginning of the project, it was unclear whether our goal of running repeatable performance tests in a public cloud was achievable. Instead of debating the issue based on assumptions and beliefs, we decided to measure noise itself and see if we could make configuration changes to minimize it. After thinking about our assumptions and the experiment setup , we began by recording data about our current setup and found no evidence of particularly good or bad EC2 instances . In the next step, we investigated IO and found that EBS instances are the stable option for us . Having found a very stable behavior as far as disks were concerned, this third and final experiment turns to tuning CPU related knobs to minimize noise from this part of the system. Investigate CPU Options We already built up knowledge around fine-tuning CPU options when setting up another class of performance benchmarks (single node benchmarks). That work had shown us that CPU options could also have a large impact on performance. Additionally, it left us familiar with a number of knobs and options we could adjust. Knob Where to set Setting What it does  Idle Strategy  Kernel Boot  idle=pool  Puts linux into a loop when idle,  checking for work.  Max sleep state    (c4 only)  Kernel Boot  intel_idle.max_cstate=1  intel_pstate=disable  Disables the use of advanced  processor sleep states.  CPU Frequency  Command Line  sudo cpupower frequency-set -d  2.901GHz  Sets a fixed frequency. Doesn't  allow the CPU to vary the  frequency for power saving.  Hyperthreading  Command Line  echo 0 > /sys/devices/system/  cpu/cpu$i/online  Disables hyperthreading.  Hyperthreading allows two  software threads of execution to  share one physical CPU. They  compete against each other for  resources. We added some CPU specific tests to measure CPU variability. These tests allow us to see if the CPU performance is noisy, independently of whether that noise makes MongoDB performance noisy. For our previous work on CPU options, we wrote some simple tests in our C++ harness that would, for example: multiply numbers in a loop (cpu bound) sleep 1 or 10 ms in a loop Do nothing (no-op) in the basic test loop We added these tests to our System Performance project. We were able to run the tests on the client only, and going across the network. We ran our tests 5x5 times, changing one configuration at a time, and compared the results. The first two graphs below contain results for the CPU-focused benchmarks, the third contains the MongoDB-focused benchmarks. In all the below graphs, we are graphing the "noise" metric as a percentage computed from (max-min)/median and lower is better. We start with our focused CPU tests, first on the client only, and then connecting to the server. We’ve omitted the sleep tests from the client graphs for readability, as they were essentially 0. Results for CPU-focused benchmarks with different CPU options enabled The nop test is the noisiest test all around, which is reasonable because it’s doing nothing in the inner loop. The cpu-bound loop is more interesting. It is low on noise for many cases, but has occasional spikes for each case, except for the case of the c3.8xlarge with all the controls on (pinned to one socket, hyperthreading off, no frequency scaling, idle=poll). Results for tests run on server with different CPU options enabled When we connect to an actual server, the tests become more realistic, but also introduce the network as a possible source of noise. In the cases in which we multiply numbers in a loop (cpuloop) or sleep in a loop (sleep), the final c3.8xlarge with all controls enabled is consistently among the lowest noise and doesn’t do badly on the ping case (no-op on the server). Do those results hold when we run our actual tests? Results for tests run on server with different CPU options enabled Yes, they do. The right-most blue bar is consistently around 5%, which is a great result! Perhaps unsurprisingly, this is the configuration where we used all of the tuning options: idle=poll, disabled hyperthreading and using only a single socket. We continued to compare c4 and c3 instances against each other for these tests. We expected that with the c4 being a newer architecture and having more tuning options, it would achieve better results. But this was not the case, rather the c3.8xlarge continued to have the smallest range of noise. Another assumption that was wrong! We expected that write heavy tests, such as batched inserts, would mostly benefit from the more stable IOPS on our new EBS disks, and the CPU tuning would mostly affect cpu-bound benchmarks such as map-reduce or index build. Turns out this was wrong too - for our write heavy tests, noise did not in fact predominantly come from disk. The tuning available for CPUs has a huge effect on threads that are waiting or sleeping. The performance of threads that are actually running full speed is less affected - in those cases the CPU runs at full speed as well. Therefore, IO-heavy tests are affected a lot by CPU-tuning! Disabling CPU options in production Deploying these configurations into production made insert tests even more stable from day to day: Improvements in daily performance measurements through changing to EBS and disabling CPU options Note that the absolute performance of some tests actually dropped, because the number of available physical CPUs dropped by ½ due to only using a single socket, and disabling hyperthreading causes a further drop, though not quite a full half, of course. Conclusion Drawing upon prior knowledge, we decided to fine tune CPU options. We had previously assumed that IO-heavy tests would have a lot of noise coming from disk and that CPU tuning would mostly affect CPU-bound tests. As it turns out, the tuning available for CPUs actually has a huge effect on threads that are waiting or sleeping and therefore has a huge effect on IO-heavy tests. Through CPU tuning, we achieved very repeatable results. The overall measured performance in the tests decreases but this is less important to us. We care about stable, repeatable results more than maximum performance . This is the third and last of three bigger experiments we performed in our quest to reduce variability in performance tests on EC2 instances. You can read more about the top level setup and results as well as how we found out that EC2 instances are neither good nor bad and that EBS instances are the stable option . If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.

April 30, 2019
Engineering Reports

Ready to get Started with MongoDB Atlas?

Start Free