MongoDB Engineering Blog

Posts on Engineering efforts, achievements, and culture at MongoDB.

We Replaced an SSD with Storage Class Memory; Here is What We Learned

On April 2, 2019 Intel Optane Persistent Memory became the first commercially available storage class memory (SCM) product. Like SSD, this memory is persistent, and like DRAM it sits on the memory bus. Long before a commercial release, system architects pondered how exactly SCM fits in the storage hierarchy, and now came an opportunity to perform concrete measurements. One question we wanted to answer is whether a storage device sitting on a memory bus can deliver better throughput than an equivalent storage device sitting on a PCI-e. There are two ways to access SCM: byte interface and block interface. Byte interface allows using load and store instructions, just like with DRAM. Block interface exposes SCM as a block device, optionally with a file system on top: this way it can be accessed just like a conventional SSD. The load/store API is streamlined, because nothing stands between the application and the hardware, but also tricky to use, because it does not come with features like crash consistency, the way file system API usually does. Accessing SCM via the block or file system API comes with OS overhead, but there is no need to rewrite applications. WiredTiger, MongoDB’s storage engine that we evaluated in this article, reads and writes data to/from storage using sizeable blocks (typically 4KB or larger). Besides being a necessity on conventional storage hardware today, using block API has other practical advantages. For example, compression and encryption, features that customers covet, are optimized to work with blocks. Similarly, checksums that safeguard from data corruption are computed on blocks of data. Furthermore, WiredTiger caches blocks of data in its DRAM-resident cache, which together with the OS buffer cache is a boon to performance. Block-orientedness and reliance on caching positioned WiredTiger, like many other storage engines, to effectively hide latency of slow storage devices . As a result, our experiments revealed that a modest latency advantage that SCM provides over a technology-equivalent SSD does not translate into performance advantages for realistic workloads. The storage engine effectively masks these latency differences. SCM will shine when it is used for latency-sensitive operations that cannot be hidden with batching and caching, such as logging. In the rest of the article we detail the results of our experiments that lead us to make this conclusion. Experimental Platform We experimented with two storage devices: Intel Optane DC Persistent Memory (PM) and Intel Optane P4800X SSD . Both are built with the Intel Optane 3D XPoint non-volatile memory, but the former is a SCM that sits on the memory bus while the latter is a PCIe-attached SSD. Microbenchmarks To begin with, we gauged raw device bandwidth with microbenchmarks that read or write a 32GB file using 8KB blocks. We vary the number of threads simultaneously accessing the file, each its own chunk. A file can be accessed either via system calls (read/write) or mmap; the latter method usually has less overhead . SSD drive's raw performance meets the spec. According to the spec , our P48000X drive is capable of up to 2.5GB/s sequential read bandwidth and up to 2.2GB/s sequential write bandwidth. Here are the numbers we observed via the Ubuntu Linux (5.3 kernel) raw device API, meaning that the data bypasses the OS buffer cache. Raw SSD performance, sequential reads and writes The read bandwidth behaves according to the specification as long as we use at least two threads. The write bandwidth, unexpectedly, exceeds its specified upper bound when using multiple threads. We suspect that this could be due to buffering writes either at the OS or at the device. The Optane P4800X SSD is faster than a typical SSD at the time of this writing, but not to the point of being incomparable. While the Optane SSD offers up to a 2.5GB/s of sequential read bandwidth, a typical NAND SSD (e.g., Intel SSD Pro 6000p series) offers up to 1.8GB/s. The difference is more noticeable with writes. The Optane drive can deliver up to 2.2 GB/s, while the NAND drive can do no more than 0.56 GB/s. Random reads and writes on the Optane SSD are not that much worse than sequential ones. We are able to achieve (using mmap) close to peak sequential throughput with reads and only 10% short of peak sequential throughput for writes. Near-identical performance of sequential and random access is a known feature of these drives . SCM offers high-bandwidth reads, but low-bandwidth writes Now let us look at the raw performance of SCM. Intel systems supporting Optane PM can fit up to six DIMMs; our experimental system had only two. We measured the throughput on a single DIMM and two DIMMs used together, to extrapolate scaling as the number of DIMMs increases. We also relied on data from other researchers to confirm our extrapolation. There are two ways to obtain direct access to PM: (1) devdax — a PM module is exposed as a character device and (2) fsdax — in this case we have a file system on top of a PM module masquerading as a block device, but file accesses bypass the buffer cache via the Direct Access (DAX) mode. In our experiments we used the ext4 file system. The following chart shows the throughput of sequential reads and writes obtained via these access methods. In all cases we use mmap, because that is the only method supported by devdax. Sequential read bandwidth of a single PM module reaches about 6.4 GB/s; that matches observations of other researchers . Random access behaves almost identically to sequential access, so we omit the charts. Storage class memory, sequential reads, single PMEM device. Storage class memory, sequential writes, single PMEM device. Write experiments tell a different story. The single-module write throughput achieves a mere 0.6 GB/s. This measurement does not agree with the data from the UCSD researchers who observed around 2.3GB/s write bandwidth on a single device . Further investigation led us to believe that this was due to differences in hardware versions. That said, our observations reveal that a single PM module achieves write throughput comparable only to a NAND SSD. Next, let’s look at scaling across two devices. The following figure shows the measurements for sequential reads and writes, using mmap over fsdax. We used the Linux striped device mapper to spread the load across two DIMMs. For reads, with two DIMMs we can almost double the peak read bandwidth, from 6.4 GB/s with one DIMM to 12.4 GB/s with two. Similarly, researchers at UCSD observed nearly linear scaling across six DIMMs . Storage class memory, sequential reads, comparison between one and two PMEM devices. Storage class memory, sequential writes, comparison between one and two PMEM devices. For writes, we achieve nearly 1 GB/s of write throughput with two DIMMs relative to 0.6 GB/s with one, but the scaling is less than linear if we can extrapolate from a single data point. The USCD researchers observed that bandwidth with six DIMMs improved by 5.6x relative to using a single DIMM, which is in line with our observation. Extrapolating from these data points, if our system had six DIMMs, we’d observe around 3.4 GB/s of peak write bandwidth, which is about 50% better than the Optane SSD. In summary, with bare device access we see about 2.5 GB/s of peak read bandwidth on the Optane SSD and about 6.4 GB/s on a single Optane PM module. With six modules, the throughput would be ~38GB/s. Write throughput was only 0.6 GB/s on a single PM module, projected to reach 3.4 GB/s with six, while the Optane SSD reached 2.2 GB/s write bandwidth. Optane SCM has a significant edge over the SSD with respect to reads, and a small advantage in writes, provided you can afford six PM modules; otherwise, an SSD will deliver a higher write throughput. Software caching attenuates SCM performance advantage While SCM is closer to the speed of DRAM than traditional storage media, DRAM is still faster, so advantages of DRAM caching are difficult to overlook. The following charts shows that with the buffer cache on (here I am using ext4 without the DAX option), all devices perform roughly the same, regardless of whether we are doing reads or writes, random or sequential access. These experiments were done with the warm buffer cache, i.e., the file was already in the buffer cache before I began the experiment, so here we are measuring pure DRAM performance. With access to data taking less time, software overheads become more evident, which is why mmap is much faster than system calls if we use eight or more threads. Sequential reads on SSD and SCM with a warm buffer cache. Random reads on SSD and SCM with a warm buffer cache. Sequential writes on SSD and SCM with a warm buffer cache. Random writes on SSD and SCM with a warm buffer cache. If we begin each experiment with a cold buffer cache, the difference between the devices is still visible, but less apparent than if we bypass the buffer cache altogether. With cold buffer cache, on the read path the OS has to copy the data from the storage device into the buffer cache before making it available to the application (hence extra software overhead). Furthermore, with buffer cache on, the OS is not using huge pages. These factors dampen the raw read advantage of SCM. For writes, whereas SCM used to deliver lower bandwidth than SSD with raw access, now SCM outpaces SSD, likely because the buffer cache absorbs and batches some of them, instead of flushing each write to the device immediately. Sequential reads on SSD and SCM with a cold buffer cache. Random reads on SSD and SCM with a cold buffer cache. Sequential writes on SSD and SCM with a cold buffer cache. Random writes on SSD and SCM with a cold buffer cache. Experiments with the storage engine Like most storage engines, WiredTiger was designed and tuned to leverage a DRAM cache. Both WiredTiger internal cache and the OS buffer cache are crucial for performance in all workloads we measured. Running WiredTiger without the OS buffer cache (fsdax mode) reduced its performance by up to 30x in our experiments; hence, we did not use the direct-access mode. We run the WiredTiger’s wtperf benchmark suite, which was designed to stress various parts of the system and emulate typical workloads observed in practice. WiredTiger internal cache size varies between a few to a hundred gigabytes across the benchmarks, and most benchmarks use at least a dozen threads. (Detailed configuration parameters for all benchmarks can be viewed here .) There is no locking in WiredTiger on the common path, so thread-level concurrency usually translates into high CPU utilization and concurrent I/O. As described in our previous blog post we added a feature to WiredTiger to use mmap for I/O instead of system calls. The following chart shows the performance of the wtperf suite on Intel Optane SCM (one and two modules) and on the Optane SSD. We see no significant performance difference between SCM and SSD. Apart from one write-intensive benchmark evict-btree-1, which is faster on SSD, there are no statistically significant differences between the two. Using a dual-module SCM over a single-module SCM gives no performance advantage either. While SCM has a higher bandwidth than a technology-equivalent SSD, the advantage is within an order of magnitude, and, turns out, that effective DRAM caching hides that difference. Latency, and not bandwidth, is where SCM can shine. In contrast to bandwidth, the latency of reading a block of data from an Optane PM is two orders of magnitude shorter than reading it from an Optane SSD: 1 microsecond vs 100-200 microseconds. The most obvious place in a storage engine where latency could be the bottleneck is logging, and academic literature is ripe with successful examples of using Optane PM for logging. In the meantime, stay tuned for our next report on exploring persistent memory. WiredTiger benchmarks on SSD and SCM. Group 1. WiredTiger benchmarks on SSD and SCM. Group 2. WiredTiger benchmarks on SSD and SCM. Group 3. If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.

August 27, 2020
Engineering Blog

Getting Storage Engines Ready for Fast Storage Devices

Over the past two decades, performance of storage hardware increased by two orders of magnitude. First, with the introduction of solid state drives (SSD), then with the transition from SATA to PCIe, and finally with the innovation in non-volatile memory technology and the manufacturing process [ 1 , 7 ]. More recently, in April 2019, Intel released the first commercial Storage Class Memory (SCM). Its Optane DC Persistent Memory, built with 3D XPoint technology, sits on a memory bus and further reduces I/O latency [ 2 ]. While device access used to dominate I/O latency, the cost of navigating the software stack of a storage system is becoming more prominent as devices’ access time shrinks. This is resulting in a flurry of academic research and in changes to commercially used operating systems (OS) and file systems. Despite these efforts, mainstream system software is failing to keep up with rapidly evolving hardware. Studies [ 4 , 5 , 6 ] have shown that file system and other OS overhead still dominates the cost of I/O in very fast storage devices, such as SCMs. In response to these challenges, academics proposed a new user-level file system, SplitFS [ 6 ], that substantially reduces these overheads. Unfortunately, adopting a user-level file system is not a viable option for many commercial products. Apart from concerns about correctness, stability, and maintenance, adoption of SplitFS would restrict portability, as it only runs on Linux and only on top of the ext4-DAX file system. Fortunately, there IS something that can be done in software storage engines that care about I/O performance. Within MongoDB’s storage engine, WiredTiger, we were able to essentially remove the brakes that the file system applied to our performance without sacrificing the convenience it provides or losing portability. Our changes rely on using memory-mapped files for I/O and batching expensive file system operations. These changes resulted in up to 63% performance improvements for 19 out of 65 benchmarks on mainstream SSDs. Streamlining I/O in WiredTiger Our changes to WiredTiger were inspired by a study from UCSD [ 4 ], where the authors demonstrated that by using memory-mapped files for I/O and by pre-allocating some extra space in the file whenever it needed to grow, they could achieve almost the same performance as if the file system was completely absent. Memory-mapped files Memory-mapped files work as follows. The application makes an mmap system call, whereby it requests the operating system to “map” a chunk of its virtual address space to a same-sized chunk in the file of its choice (Step 1 in Fig.1). When it accesses memory in that portion of the virtual address space for the first time (e.g., virtual page 0xABCD in Fig. 1), the following events take place: Since this is a virtual address that has not been accessed before, the hardware will generate a trap and transfer control to the operating system. The operating system will determine that this is a valid virtual address, ask the file system to read the corresponding page-sized part of the file into its buffer cache, and Create a page table entry mapping the user virtual page to the physical page in the buffer cache (e.g., physical page 0xFEDC in Fig.1), where that part of the file resides (Step 2 in Fig 1). Finally, the virtual-to-physical translation will be inserted into the Translation Lookaside Buffer (TLB -- a hardware cache for these translations), and the application will proceed with the data access. Memory mapped files work as follows: (1) They establish a virtual memory area for the mapped file, (2) They place the virtual-to-physical address translation into the page table, (3) They cache the translation in the Translation Lookaside Buffer (TLB) Subsequent accesses to the same virtual page may or may not require operating system involvement, depending on the following: If the physical page containing the file data is still in the buffer cache and the page table entry is in the TLB, operating system involvement is NOT necessary, and the data will be accessed using regular load or store instructions. If the page containing the file data is still in the buffer cache, but the TLB entry was evicted, the hardware will transition into kernel mode, walk the page table to find the entry (assuming x86 architecture), install it into the TLB and then let the software access the data using regular load or store instructions. If the page containing the file data is not in the buffer cache, the hardware will trap into the OS, which will ask the file system to fetch the page, set up the page table entry, and proceed as in scenario 2. In contrast, system calls cross the user/kernel boundary every time we need to access a file. Even though memory-mapped I/O also crosses the user/kernel boundary in the second and third scenarios described above, the path it takes through the system stack is more efficient than that taken by system calls. Dispatching and returning from a system call adds CPU overhead that memory-mapped I/O does not have [ 8 ]. Furthermore, if the data is copied from the memory mapped file area to another application buffer, it would typically use a highly optimized AVX-based implementation of memcpy. When the data is copied from the kernel space into the user space via a system call, the kernel has to use a less efficient implementation, because the kernel does not use AVX registers [ 8 ]. Pre-allocating file space Memory-mapped files allow us to substantially reduce the involvement of the OS and the file system when accessing a fixed-sized file. If the file grows, however, we do need to involve the file system. The file system will update the file metadata to indicate its new size and ensure that these updates survive crashes. Ensuring crash consistency is especially expensive, because each journal record must be persisted to storage to make sure it is not lost in the event of a crash. If we grow a file piecemeal, we incur that overhead quite often. That is why the authors of SplitFS [ 6 ] and the authors of the UCSD study [ 4 ] both pre-allocate a large chunk of the file when an application extends it. In essence, this strategy batches file system operations to reduce their overhead. Our Implementation The team applied these ideas to WiredTiger in two phases. First, we implemented the design where the size of the mapped file area never changes. Then, after making sure that this simple design works and yields performance improvements, we added the feature of remapping files as they grow or shrink. That feature required efficient inter-thread synchronization and was the trickiest part of the whole design -- we highlight it later in this section. Our changes have been in testing in the develop branch of WiredTiger as of January 2020. As of the time of the writing, these changes are only for POSIX systems; a Windows port is planned for the future. Assuming a fixed-size mapped file area Implementing this part required few code changes. WiredTiger provides wrappers for all file-related operations, so we only needed to modify those wrappers. Upon opening the file, we issue the mmap system call to also map it into the virtual address space. Subsequent calls to wrappers that read or write the file will copy the desired part of the file from the mapped area into the supplied buffer. WiredTiger allows three ways to grow or shrink the size of the file. The file can grow explicitly via a fallocate system call (or its equivalent), it can grow implicitly if the engine writes to the file beyond its boundary, or the file can shrink via the truncate system call. In our preliminary design we disallowed explicitly growing or shrinking the file, which did not affect the correctness of the engine. If the engine writes to the file beyond the mapped area, our wrapper functions simply default to using system calls. If the engine then reads the part of the file that had not been mapped, we also resort to using a system call. While this implementation was decent as an early prototype, it was too limiting for a production system. Resizing the mapped file area The trickiest part of this feature is synchronization. Imagine the following scenario involving two threads, one of which is reading the file and another one truncating it. Prior to reading, the first thread would do the checks on the mapped buffer to ensure that the offset from which it reads is within the mapped buffer’s boundaries. Assuming that it is, it would proceed to copy the data from the mapped buffer. However, if the second thread intervenes just before the copy and truncates the file so that its new size is smaller than the offset from which the first thread reads, the first thread’s attempt to copy the data would result in a crash. This is because the mapped buffer is larger than the file after truncation and attempting to copy data from the part of the buffer that extends beyond the end of the file would generate a segmentation fault. An obvious way to prevent this problem is to acquire a lock every time we need to access the file or change its size. Unfortunately, this would serialize I/O and could severely limit performance. Instead, we use a lock-free synchronization protocol inspired by read-copy-update (RCU) [ 9 ]. We will refer to all threads that might change the size of the file as writers. A writer, therefore, is any thread that writes beyond the end of the file, extends it via a fallocate system call, or truncates it. A reader is any thread that reads the file. Our solution works as follows: A writer first performs the operation that changes the size of the file and then remaps the file into the virtual address space. During this time we want nobody else accessing the mapped buffer, neither readers nor writers. However, it is not necessary to prevent all I/O from occurring at this time; we can simply route I/O to system calls while the writer is manipulating the mapped buffer, since system calls are properly synchronized in the kernel with other file operations. To achieve these goals without locking, we rely on two variables: mmap_resizing: when a writer wants to indicate to others that it is about to exclusively manipulate the mapped buffer, it atomically sets this flag. mmap_use_count: a reader increments this counter prior to using the mapped buffer, and decrements it when it is done. So this counter tells us if anyone is currently using the buffer. The writer waits until this counter goes to zero before proceeding. Before resizing the file and the mapped buffer, writers execute the function prepare_remap_resize_file ; its pseudocode is shown below. Essentially, the writer efficiently waits until no one else is resizing the buffer, then sets the resizing flag to claim exclusive rights to the operation. Then, it waits until all the readers are done using the buffer. prepare_remap_resize_file: wait: /* wait until no one else is resizing the file */ while (mmap_resizing != 0) spin_backoff(...); /* Atomically set the resizing flag, if this fails retry. */ result = cas(mmap_resizing, 1, …); if (result) goto wait; /* Now that we set the resizing flag, wait for all readers to finish using the buffer */ while (mmap_use_count > 0) spin_backoff(...); After executing prepare_remap_resize_file , the writer performs the file-resizing operation, unmaps the buffer, remaps it with the new size and resets the resizing flag. The synchronization performed by the readers is shown in the pseudocode of the function read_mmap : read_mmap: /* Atomically increment the reference counter, * so no one unmaps the buffer while we use it. */ atomic_add(mmap_use_count, 1); /* If the buffer is being resized, use the system call instead of the mapped buffer. */ if (mmap_resizing) atomic_decr(mmap_use_count, 1); read_syscall(...); else memcpy(dst_buffer, mapped_buffer, …); atomic_decr(mmap_use_count, 1); As a side note, threads writing the file must perform both the reader synchronization, as in read_mmap, to see if they can use the memory-mapped buffer for I/O, and the writer synchronization in the case they are writing past the end of the file (hence extending its size). Please refer to the WiredTiger develop branch for the complete source code. Batching file system operations As we mentioned earlier, a crucial finding of the UCSD study that inspired our design [ 4 ], was the need to batch expensive file system operations by pre-allocating file space in large chunks. Our experiments with WiredTiger showed that it already uses this strategy to some extent. We ran experiments comparing two configurations: (1) In the default configuration WiredTiger uses the fallocate system call to grow files. (2) In the restricted configuration WiredTiger is not allowed to use fallocate and thus resorts to implicitly growing files by writing past their end. We measured the number of file system invocations in both cases and found that it was at least an order of magnitude smaller in the default configuration than in the restricted. This tells us that WiredTiger already batches file system operations. Investigating whether batching can be optimized for further performance gains is planned for the future. Performance To measure the impact of our changes, we compared the performance of the mmap branch and the develop branch on the WiredTiger benchmark suite WTPERF. WTPERF is a configurable benchmarking tool that can emulate various data layouts, schemas, and access patterns while supporting all kinds of database configurations. Out of 65 workloads, the mmap branch improved performance for 19. Performance of the remaining workloads either remained unchanged or showed insignificant changes (within two standard deviations of the average). Variance in performance of two workloads (those that update a log-structured merge tree) increased by a few percent, but apart from these, we did not observe any downsides to using mmap. The figures below show the performance improvement, in percent, of the mmap branch relative to develop for the 19 benchmarks where mmap made a difference. The experiments were run on a system with an Intel Xeon processor E5-2620 v4 (eight cores), 64GB of RAM and an Intel Pro 6000p series 512GB SSD drive. We used default settings for all the benchmarks and ran each at least three times to ensure the results are statistically significant. All but 2 of the benchmarks where mmap made a difference show significant improvements Overall, there are substantial performance improvements for these workloads, but there are a couple interesting exceptions. For 500m-btree-50r50u and for update-btree some operations (e.g., updates or inserts) are a bit slower with mmap, but others (typically reads) are substantially faster. It appears that some operations benefit from mmap at the expense of others; we are still investigating why this is happening. One of the variables that explains improved performance with mmap is increased rate of I/O. For example, for the 500m-btree-50r50u workload (this workload simulates a typical MongoDB load) the read I/O rate is about 30% higher with mmap than with system calls. This statistic does not explain everything: after all, read throughput for this workload is 63% better with mmap than with system calls. Most likely, the rest of the difference is due to more efficient code paths of memory-mapped I/O (as opposed to going through system calls), as observed in earlier work [8]. Indeed, we typically observe a higher CPU utilization when using mmap. Conclusion Throughput and latency of storage devices improve at a higher rate than CPU speed thanks to radical innovations in storage technology and the placement of devices in the system. Faster storage devices reveal inefficiencies in the software stack. In our work we focussed on overhead related to system calls and file system access and showed how it can be navigated by employing memory-mapped I/O. Our changes in the WiredTiger storage engine yielded up to 63% improvement in read throughput. For more information on our implementation, we encourage you to take a look at the files os_fs.c and os_fallocate.c in the os_posix directory of the WiredTiger develop branch . References [1] List of Intel SSDs. https://en.wikipedia.org/wiki/List_of_Intel_SSDs [2] Optane DC Persistent Memory. https://www.intel.ca/content/www/ca/en/architecture-and-technology/optane-dc-persistent-memory.html [3] Linux® Storage System Analysis for e.MMC with Command Queuing, https://www.micron.com/-/media/client/global/documents/products/white-paper/linux_storage_system_analysis_emmc_command_queuing.pdf?la=en [4] Jian Xu, Juno Kim, Amirsaman Memaripour, and Steven Swanson. 2019. Finding and Fixing Performance Pathologies in Persistent Memory Software Stacks. In 2019 Architectural Support for Program- ming Languages and Operating Systems (ASPLOS ’19). http://cseweb.ucsd.edu/~juk146/papers/ASPLOS2019-APP.pdf [5] Jian Xu and Steven Swanson, NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories, 14th USENIX Conference on File and Storage Technologies (FAST’16). https://www.usenix.org/system/files/conference/fast16/fast16-papers-xu.pdf [6] Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim, Aasheesh Kolli, and Vijay Chidambaram. 2019. SplitFS: reducing software overhead in file systems for persistent memory. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19). https://www.cs.utexas.edu/~vijay/papers/sosp19-splitfs.pdf [7] SDD vs HDD. https://www.enterprisestorageforum.com/storage-hardware/ssd-vs-hdd.html [8] Why mmap is faster than system calls. https://medium.com/@sasha_f/why-mmap-is-faster-than-system-calls-24718e75ab37 [9] Paul McKinney. What is RCU, fundamentally? https://lwn.net/Articles/262464/ If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.

March 16, 2020
Engineering Blog

Transpiling Between Any Programming Languages

Input is converted via an ANTLR parse tree and code generation to output MongoDB Compass, the UI for MongoDB, recently introduced a pair of features to make it easier for developers to work in their chosen language. Users can now export the queries and aggregations they build in the UI to their preferred language. Soon they will also be able to input them in their preferred language. Allowing developers the flexibility to choose between multiple input languages and multiple output languages when using Compass required us to build a custom solution in the form of a many-to-many transpiler. Most compilers are one-to-one, or less commonly, one-to-many or many-to-one. There are hardly any many-to-many transpilers. To avoid having to start from scratch, we leveraged the open source parsing tool ANTLR which provided us with a set of compiler tools along with preexisting grammars for the languages we needed. We successfully minimized the amount of additional complexity by coming up with a creative set of class hierarchies that reduced the amount of work needed from n² to n. Motivation MongoDB Compass is an application that provides a UI for the database and helps developers to iteratively develop aggregations and queries. When building queries, the application currently requires input to be made in a JavaScript-based query language called MongoDB Shell. Compass aggregation pipeline builder To enable developers to use their preferred programming language when developing aggregation pipelines and queries, we wanted to add functionality in two parts. First, we wanted to allow developers that are familiar with the MongoDB Shell to export queries they create in the language they need (Python, Java, etc.). Second, we wanted to allow developers to use their language of choice while building a query. To achieve both and allow users maximum flexibility, our system therefore needed to accept multiple input languages as well as generate multiple output languages in an efficient way. Compass Export to Language allows you to export a pipeline in the language of your choice At the basis of these features is sophisticated compiler technology in the form of a transpiler. A transpiler is a source-to-source compiler which takes the source code of a program written in one programming language as its input and produces the equivalent source code in another programming language. Since our transpiler currently supports extended JSON, also called BSON, we call it a BSON transpiler. While we currently support only a subset of each programming language, the transpiler is designed in a way that would allow us to extend support to include the entire language syntax. Design Approach The Compass application is designed with an extensible plugin architecture allowing us to build the transpiler as a standalone plugin. To work with the Electron framework our application is based on, our plugin needed to be executable in JavaScript. There are lots of different transpilers in JavaScript which we considered. However, for our use case, we needed any language to any language transformation with support for BSON, which meant we needed a custom solution. Compass queries always take the form of either an array of BSON documents (stages for the aggregation pipeline) or a BSON document (for other queries) containing the MongoDB query language. While this constraint reduces the scope of the problem for the BSON transpiler, the language subset we need to support is large and complex enough that we decided to treat the problem as if we were adding full-language support. The naive approach to building a source-to-source compiler supporting multiple languages would result in a polynomial amount of effort since the number of language combinations is the product of the number of input and output languages. We needed to build a sustainable design so that adding new input and output languages only requires building O(1) components per language. This reduces the entire problem to O(n) for the number of languages n. We achieved this by abstracting the problem into independent input and output stages that are loosely coupled by their interface to a shared, intermediate, in-memory data structure: a parse tree. The input language stage just needs to build the tree, and the output language stage just needs to read it. Most compilers have two primary stages: parsing and code generation. The parsing stage is responsible for turning the literal text of a program into a tree representing an abstraction of its meaning, and the code generation stage walks that tree and produces an output that can be executed -- generally a binary of machine or virtual machine instructions. A key observation is that a source-to-source compiler can be seen as a specialized compiler in which the code generation stage generates program text, in another user-friendly language, instead of machine code. The design of our transpiler stems from that concept. Parsing In order to process some source code, such as the string new NumberDecimal(5) , a lexical analyzer or lexer takes raw code and splits it into tokens (this process is known as lexical analysis). A token is an object that represents a block of text corresponding to one of the primitive components of the language syntax. It could be a number, label, punctuation, an operator, etc. In the parsing stage these tokens are then transformed into a tree structure that describes not only isolated pieces of the input code but also their relationship to each other. At this point the compiler is able to recognise language constructs such as variable declarations, statements, expressions, and so on. The leaves of this tree are the tokens found by the lexical analysis. When the leaves are read from left to right, the sequence is the same as in the input text. Stages of compiler processing: Input is transformed via lexical analysis into tokens, tokens are transformed via syntax analysis into an AST which is used to generate the output code We did not want to write our own parser and lexer since it is incredibly time consuming even for a single language, and we have to support multiple. Luckily, there are many "parser-generator" tools that efficiently generate syntax trees from a set of rules, called a grammar. These tools take an input grammar, which is hierarchical and highly structured, parse an input string based on that grammar, and convert it to a tree structure. The tricky part of using parser-generators is the tedious and error-prone process of writing the grammars. Writing a grammar from scratch requires a detailed knowledge of the input language with all of its edge cases. If the transpiler needs to support many programming languages, we would have to write grammars for each of the input languages which would be a huge task. Source-to-source transformation with ANTLR This is why we decided to use ANTLR , a powerful parser-generator that, most importantly, already has grammars for almost all programming languages of interest. ANTLR also has a JavaScript runtime, allowing us to use it in our Node.js project. We considered using the LLVM-IR, a different set of compiler technologies that compile to an intermediate high-level representation . This approach would then need a separate step to compile the intermediate representation into the target language. This is a common pattern for multi-platform compilers, like the Clang/ LLVM project . Unfortunately, there are not many existing compilers that go from the intermediate representation back to user programming languages. We would have had to write those compilers ourselves, so ultimately using LLVM would not have saved us much effort. The code snippet below illustrates the basic structure of a program that builds a parse tree for ECMAScript (Javascript) input source code. This code imports auxiliary lexer and parser files and lets ANTLR pull characters from the input string, create a character stream, convert it to a token stream and finally build a parse tree. // It takes only a few lines to go from a string input to a fully parsed traversable tree! const antlr4 = require('antlr4'); const ECMAScriptLexer = require('./lib/antlr/ECMAScriptLexer.js'); const ECMAScriptParser = require('./lib/antlr/ECMAScriptParser.js'); const input = 'new NumberDecimal(5)'; const chars = new antlr4.InputStream(input); const lexer = new ECMAScriptLexer.ECMAScriptLexer(chars); const tokens = new antlr4.CommonTokenStream(lexer); const parser = new ECMAScriptParser.ECMAScriptParser(tokens); const tree = parser.program(); The resulting parse tree inherits from the ANTLR-defined ParseTree class, giving it a uniform way to be traversed. Note that the parsing stage and resulting parse tree are determined by the input language; they are completely independent of the stage where we generate the output language into which we seek to translate the source code. This independence in our design allows us to reduce the number of parts we need to write to cover our input and output languages from O(n²) to O(n). Code Generation Tree types Using ANTLR for its library of pre-built grammars requires a slight compromise in our design. To understand why, it is necessary to understand the difference between two terms that are related and sometimes used interchangeably: a parse tree and an abstract syntax tree (AST). Conceptually these trees are similar because they both represent the syntax of a snippet of source code; the difference is the level of abstraction. An AST has been fully abstracted to the point that no information about the input tokens themselves remains. Because of this, ASTs representing the same instructions are indistinguishable, regardless of what language produced them. By contrast, a parse tree contains the information about the low-level input tokens, so different languages will produce different parse trees, even if they do the same thing. Abstract syntax tree and parse tree comparison given an input of "new NumberDecimal(5)" Ideally, our code generating stage would operate on an AST, not a parse tree, because having to account for language-specific parse trees introduces complexity we’d rather avoid. ANTLR4, however, only produces read-only parse trees. But the advantages of using ANTLR and its ready-made grammars are well worth that trade-off. Visitors Parse tree traversal Like most compilers, the BSON transpiler uses a visitor pattern to traverse parse trees. ANTLR not only builds a parse tree but it also programmatically generates a skeleton visitor class. This visitor class contains methods for traversing parse trees (one visit method for each type of node of the tree). All of these methods begin with visit and end with the name of the node that it will visit - e.g. visitFuncCall() or visitAdditiveExpression() . The node names are taken directly from the input language grammar file, so each visitor class and its methods are tailored to the input language grammar. In the ANTLR-generated visitor class, these methods do not do anything except recurse on the child nodes. In order for our visitor to be able to transpile code, we need to subclass the generated visitor class and override each visit method to define what to do with each type of node. Since the BSON transpiler is built to support multiple input languages, and each language will produce a different kind of parse tree, we need to create one custom visitor for each input language supported by Compass. However, as long as we avoid building a custom visitor for each combination of input and output language, we are still only building O(n) components. With this design, each visitor is responsible for traversing a single language parse tree. The visitor calls functions as it visits each node and returns the original text of the node or can transform this text the way we need. Starting from the root, the visitor calls the visit method recursively, descending to the leaves in a depth-first order. On the way down, the visitor decorates nodes with metadata, such as type information. On the way up, it returns the transpiled code. Generators With a brute force solution, the visit* methods of the visitor would contain code for generating the output language text. To generate multiple output languages, we would have to specialize each method depending on the current output language. Overall, this approach would subclass every language-specific visitor class once for every output language, or worse yet, put a giant switch statement in each of the visit* methods with a case for each output language. Both of those options are brittle and require O(n²) development effort. Therefore we chose to decouple the code for traversing the language-specific trees from the code for generating output. We accomplished this by encapsulating code-generation for each language into a set of classes called Generators, which implement a family of emit* methods, like emitDate and emitNumber used to produce the output code. Class composition Class dependency diagram Our design was informed by the need for the visitor to be able to call generator methods without needing to know which generator they were using. Since code generation actually has a lot in common regardless of the output language, we wanted to implement a system where we could abstract the default behavior as much as possible and leave the generator to only handle edge cases. We chose to make use of JavaScript’s dynamic mechanics for inheritance and method dispatch by having the generator class inherit from the visitor class. Because JavaScript does not require that methods be defined before they are called, the visitor can call emit methods on itself that are actually defined in the generator and the generator can call visitor methods to continue the tree traversal. Using a generator class determined by the output language and a visitor class determined from the input language, we are able to compose a transpiler on-the-fly as it is exported. Generators are similar to an abstract interface, except there are no classic interfaces in JavaScript. As illustrated in the code snippet below, for each language combination our application creates a specialized transpiler instance composed of the corresponding visitor and generator classes. When our application receives a piece of code from the user, it creates a parse tree. The transpiler then visits the parse tree, using the ParseTreeVisitor’s visit method inherited from our custom Visitor subclass, and the language-specific, ANTLR generated visitor class (such as ECMAScriptVisitor). // Each composite transpiler instance has the ability to traverse the parse tree // for a specific language with its 'visit*' methods, and generate output code for // another language with its 'emit*' methods. const getJavascriptVisitor = require('./codegeneration/javascript/Visitor'); const getJavaGenerator = require('./codegeneration/java/Generator'); const getPythonGenerator = require('./codegeneration/python/Generator'); ... const loadJSTree = (input) => { /* Lexing and parsing the user input */ ... }; /** * Compose a transpiler and return a compile method that will use that transpiler * to visit the tree and return generated code. * * @param {function} loadTree - the method takes in user input and returns a tree. * @param {Visitor} visitor - the input-language visitor. * @param {function} generator - returns a generator that inherits from it’s arg. * * @returns {function} the compile function to be exported */ const composeTranspiler = (loadTree, visitor, generator) => { const Transpiler = generator(visitor); const transpiler = new Transpiler(); return { compile: (input) => { const tree = loadTree(input); return transpiler.start(tree); } }; } module.exports = { javascript: { java: composeTranspiler( loadJSTree, getJavascriptVisitor(JavascriptANTLRVisitor), // Visitor + ANTLR visitor getJavaGenerator // Method that takes in a superclass, i.e. the visitor ), python: composeTranspiler( loadJSTree, getJavascriptVisitor(JavascriptANTLRVisitor)), getPythonGenerator ), ... }, ... } Tree Traversal Example Simple Nodes In the most straightforward case, think the JavaScript snippet of text "hello world" , the first thing the custom visitor class needs to do is specify the entry point for the tree traversal. Since the entry nodes in different languages have different names (i.e. file_input in Python, but program in JavaScript), we define a method in each visitor called start that calls the visit method for the root node for that input language. That way our compiler can simply call start on each visitor without having to worry what the root node is called. // Entry point for the tree traversal class Visitor extends ECMAScriptVisitor { start(ctx) { return this.visitProgram(ctx); } } The default behavior of the ANTLR visit methods is to recur on each child node and return the results in an array. If the node doesn’t have any children, then the visit method will return the node itself. So if we do not overwrite any of the ANTLR methods, then the return value of our start method would be an array of nodes. To go from returning nodes to returning strings in our simple "hello world" example, we first overwrite the visitTerminal method so that the leaf nodes will return the raw text stored in the node. We then modify the visitChildren method so that instead of putting the results of visiting each child node into an array, the results get concatenated into a string. Those two changes are enough for our "hello world” example to be fully translated into a language that uses the same string representation, like Python. // Overwriting of 'visitTerminal()' method class Visitor extends ECMAScriptVisitor { start(ctx) { return this.visitProgram(ctx); } // Visits a leaf node and returns a string visitTerminal(ctx) { return ctx.getText(); } // Concatenate the results of recurring on child nodes visitChildren(ctx) { return ctx.children.reduce( (code, child) => `${code} ${this.visit(child)}`, '' ); } } Transformations However, we cannot always just concatenate the text values of the terminal nodes to form the result. Instead we need to transform floating point numbers, as well as numbers in different numeral systems without losing any precision. For string literals we need to think about single and double quotes, escape sequences, comments, spaces and empty lines. This type of transformation logic can be applied to any type of node. Let’s look at a concrete example: In Python, an object property name must be enclosed in quotes ( {'hello': 'world'} ); in JavaScript this is optional ( {hello: 'world'} ). In this particular case, this is the only one modification we need in order to transform a fragment of JavaScript code into Python code. // Transformation of JavaScript code into Python code class Visitor extends ECMAScriptVisitor { ... visitPropertyExpressionAssignment(ctx) { const key = this.visit(ctx.propertyName()); const value = this.visit(ctx.singleExpression()); if ('emitPropertyExpressionAssignment' in this) { return this['emitPropertyExpressionAssignment']; } return `${key}: ${value}`; } } The propertyExpressionAssignment node has two child nodes ( propertyName and singleExpression ). To get the values of these two child nodes, we need to traverse them separately as left hand side and right hand side subtrees. Traversing subtrees returns the original or transformed values of the child nodes. We can then build a new string using retrieved values here to make up the transformed code fragment. Instead of doing this in the visitor directly, we check if the corresponding emit method exists. If the visitor finds a proper emit method, it will delegate the transformation process to the generator class. By doing this we free our visitors from knowing anything about the output language. We just assume that there is some generator class that knows how to handle the output language. However, If this method doesn’t exist, the visitor will return the original string without any transformation. In our case we assume a emitPropertyExpressionAssignment was supplied and this will return the transformed JavaScript string. Processing In more complex cases, we must do some preprocessing in the visitor before we can call any emit methods. For example, date representations are a complex case because dates have a wide range of acceptable argument formats across different programming languages. We need to do some preprocessing in the visitor so we can ensure that all the emit methods are sent the same information, regardless of input language. In this case of a date node, the easiest way to represent date information is to construct a JavaScript Date object and pass it to the generator. Node types that need pre-processing must have a process* method defined in the visitor to handle this pre-processing. For this example it would be called processDate . // 'processDate()' creates a date object to pass it to the emit method processDate(node) { let text = node.getText(); // Original input text for this node let date; try { date = this.executeJavascript(text); // Construct a date object in a sandbox } catch (error) { throw new BsonTranspilersError(error.message); } if ('emitDate' in this) { return this.emitDate(node, date); } ... } For this processDate method, since we are compiling JavaScript and the transpiler is written in JavaScript, we took a shortcut: executing the users input to construct the Date. Because it has already been tokenized we know exactly what the code contains so it is safe to execute in a sandbox. For processing dates in other language we would instead parse the results and construct the date object through arguments. Upon completion, the process method will then call the respective emit* method, emitDate and pass it the constructed Date as an argument. Now that we can call the required process and emit methods from the visitor’s appropriate visit method. // This is a generator that generates code for Python. // The 'emitDate()' method is defined in the Generator and called from the Visitor module.exports = (superClass) => class Generator extends superClass { emitDate(node, date) { const dateStr = [ date.getUTCFullYear(), date.getUTCMonth() + 1, date.getUTCDate(), date.getUTCHours(), date.getUTCMinutes(), date.getUTCSeconds() ].join(', '); return `datetime.datetime(${dateStr}, tzinfo=datetime.timezone.utc)`; } }; Given the input string Date(‘2019-02-12T11:31:14.828Z’) , the root of the parse tree will be a FuncCallExpression node. The visit method for this node will called visitFuncCallExpression() . // Example of a Visitor class that extends the ECMAScript grammar class Visitor extends ECMAScriptVisitor { /** * Visits a node that represents a function call. * * @param {FuncCallExpression} node - The tree node * @return {String} - The generated code */ visitFuncCallExpression(node) { const lhs = this.visit(node.functionName()); const rhs = this.visit(node.arguments()); if (`process${lhs}` in this) { return this[`process${lhs}`](node); } if (`emit${lhs}` in this) { return this[`emit${lhs}`](node); } return `${lhs}${rhs}`; } ... } The first thing the visit method does is recurse on its two child nodes. The left-hand-child represents the function name node, i.e. Date . The right-hand-child represents the arguments node, i.e. 2019-02-12T11:31:14.828Z . Once the method retrieves the name of the function it can check to see if that function requires any preprocessing. It checks if the processDate method is defined and, failing that check, whether an emitDate method is defined. Even though the emitDate method is defined in the generator, since the visitor and generator are composed into one class, the visitor treats emit methods as if they were its own class methods. If neither method exists, the visit* method will return a concatenation of the results of the recursion on the child nodes. Every input language has its own visitor that can contain processing logic and every output language has its own generator that contains the required transformation logic for the specific language. As a rule, transformations required by all output languages will happen as processing logic, while all other transformations happen in the generator. With this design, different transpilers based on different visitors can use the same generator methods. That way, for every input language we add, we only need to define a single visitor. Similarly, for every output language we add, we only need to define a single generator. For n languages we want to support, we now have O(n) amount of work instead of having to write one visitor-generator for every language combination. Conclusion The Compass BSON transpiler plugin has the potential to parse and generate MongoDB queries and aggregations in any programming language. The current version supports several input (MongoDB Shell, Javascript, and Python) and output (Java, C#, Python, MongoDB Shell, and Javascript) languages. The BSON transpiler plugin is built as a standalone Node.js module and can be used in any browser-based or Node.js application with npm install bson-transpilers . As many other MongoDB projects, the BSON transpiler plugin is open-source, you can go to the repo and we welcome contributions. If you want to contribute to the Compass BSON transpiler, please check our contributing section on GitHub . When writing the BSON transpiler, we were guided by general compiler design principles (lexical analysis, syntax analysis, tree traversal). We used ANTLR to reduce the amount of manual work required to parse the input languages of interest, which allowed us to focus mostly on modularizing the code generation process. A major benefit of modularizing the language definitions is that a user can contribute a new output language without needing to know anything about the input languages that are currently supported. The same rule applies for adding a new input language: you should be able to define your visitor without needing to care about the existing generators. The latest version of the BSON transpiler plugin is more complex and powerful than what has been covered by the current blog post. It supports a wider range of syntax through the use of a symbol table. It also includes the entire BSON library, function calls with arguments and type validation, and informative error messages. On top of that, we have added a high level of optimization by using string templates to abstract a lot of the code generation. All of these developments will be described in a future blog post. Written by Anna Herlihy , Alena Khineika , & Irina Shestak . Illustrations by Irina Shestak Further Reading Compiler in JavaScript using ANTLR by Alena Khineika; Compiler Construction by Niklaus Wirth; Compilers: Principles, Techniques, and Tools by Alfred V. Aho, Monica S. Lam, Ravi Sethi and Jeffrey D. Ullman; The Elements of Computing Systems by Noam Nisan and Shimon Schocken. If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.

July 1, 2019
Engineering Blog

Repeatable Performance Tests: CPU Options Are Best Disabled

In an effort to improve repeatability, the MongoDB Performance team set out to reduce noise on several performance test suites run on EC2 instances. At the beginning of the project, it was unclear whether our goal of running repeatable performance tests in a public cloud was achievable. Instead of debating the issue based on assumptions and beliefs, we decided to measure noise itself and see if we could make configuration changes to minimize it. After thinking about our assumptions and the experiment setup , we began by recording data about our current setup and found no evidence of particularly good or bad EC2 instances . In the next step, we investigated IO and found that EBS instances are the stable option for us . Having found a very stable behavior as far as disks were concerned, this third and final experiment turns to tuning CPU related knobs to minimize noise from this part of the system. Investigate CPU Options We already built up knowledge around fine-tuning CPU options when setting up another class of performance benchmarks (single node benchmarks). That work had shown us that CPU options could also have a large impact on performance. Additionally, it left us familiar with a number of knobs and options we could adjust. Knob Where to set Setting What it does  Idle Strategy  Kernel Boot  idle=pool  Puts linux into a loop when idle,  checking for work.  Max sleep state    (c4 only)  Kernel Boot  intel_idle.max_cstate=1  intel_pstate=disable  Disables the use of advanced  processor sleep states.  CPU Frequency  Command Line  sudo cpupower frequency-set -d  2.901GHz  Sets a fixed frequency. Doesn't  allow the CPU to vary the  frequency for power saving.  Hyperthreading  Command Line  echo 0 > /sys/devices/system/  cpu/cpu$i/online  Disables hyperthreading.  Hyperthreading allows two  software threads of execution to  share one physical CPU. They  compete against each other for  resources. We added some CPU specific tests to measure CPU variability. These tests allow us to see if the CPU performance is noisy, independently of whether that noise makes MongoDB performance noisy. For our previous work on CPU options, we wrote some simple tests in our C++ harness that would, for example: multiply numbers in a loop (cpu bound) sleep 1 or 10 ms in a loop Do nothing (no-op) in the basic test loop We added these tests to our System Performance project. We were able to run the tests on the client only, and going across the network. We ran our tests 5x5 times, changing one configuration at a time, and compared the results. The first two graphs below contain results for the CPU-focused benchmarks, the third contains the MongoDB-focused benchmarks. In all the below graphs, we are graphing the "noise" metric as a percentage computed from (max-min)/median and lower is better. We start with our focused CPU tests, first on the client only, and then connecting to the server. We’ve omitted the sleep tests from the client graphs for readability, as they were essentially 0. Results for CPU-focused benchmarks with different CPU options enabled The nop test is the noisiest test all around, which is reasonable because it’s doing nothing in the inner loop. The cpu-bound loop is more interesting. It is low on noise for many cases, but has occasional spikes for each case, except for the case of the c3.8xlarge with all the controls on (pinned to one socket, hyperthreading off, no frequency scaling, idle=poll). Results for tests run on server with different CPU options enabled When we connect to an actual server, the tests become more realistic, but also introduce the network as a possible source of noise. In the cases in which we multiply numbers in a loop (cpuloop) or sleep in a loop (sleep), the final c3.8xlarge with all controls enabled is consistently among the lowest noise and doesn’t do badly on the ping case (no-op on the server). Do those results hold when we run our actual tests? Results for tests run on server with different CPU options enabled Yes, they do. The right-most blue bar is consistently around 5%, which is a great result! Perhaps unsurprisingly, this is the configuration where we used all of the tuning options: idle=poll, disabled hyperthreading and using only a single socket. We continued to compare c4 and c3 instances against each other for these tests. We expected that with the c4 being a newer architecture and having more tuning options, it would achieve better results. But this was not the case, rather the c3.8xlarge continued to have the smallest range of noise. Another assumption that was wrong! We expected that write heavy tests, such as batched inserts, would mostly benefit from the more stable IOPS on our new EBS disks, and the CPU tuning would mostly affect cpu-bound benchmarks such as map-reduce or index build. Turns out this was wrong too - for our write heavy tests, noise did not in fact predominantly come from disk. The tuning available for CPUs has a huge effect on threads that are waiting or sleeping. The performance of threads that are actually running full speed is less affected - in those cases the CPU runs at full speed as well. Therefore, IO-heavy tests are affected a lot by CPU-tuning! Disabling CPU options in production Deploying these configurations into production made insert tests even more stable from day to day: Improvements in daily performance measurements through changing to EBS and disabling CPU options Note that the absolute performance of some tests actually dropped, because the number of available physical CPUs dropped by ½ due to only using a single socket, and disabling hyperthreading causes a further drop, though not quite a full half, of course. Conclusion Drawing upon prior knowledge, we decided to fine tune CPU options. We had previously assumed that IO-heavy tests would have a lot of noise coming from disk and that CPU tuning would mostly affect CPU-bound tests. As it turns out, the tuning available for CPUs actually has a huge effect on threads that are waiting or sleeping and therefore has a huge effect on IO-heavy tests. Through CPU tuning, we achieved very repeatable results. The overall measured performance in the tests decreases but this is less important to us. We care about stable, repeatable results more than maximum performance . This is the third and last of three bigger experiments we performed in our quest to reduce variability in performance tests on EC2 instances. You can read more about the top level setup and results as well as how we found out that EC2 instances are neither good nor bad and that EBS instances are the stable option . If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.

April 30, 2019
Engineering Blog

Reducing Variability in Performance Tests on EC2: Setup and Key Results

On the MongoDB Performance team, we use EC2 to run daily system performance tests. After building a continuous integration system for performance testing, we realized that there were sources of random variation in our platform and system configuration which made a lot of our results non-reproducible. The run to run variation from the platform was bigger than the changes in MongoDB performance that we wanted to capture. To reduce such variation - environmental noise - from our test configuration, we set out on a project to measure and control for the EC2 environments on which we run our tests. At the outset of the project there was a lot of doubt and uncertainty. Maybe using a public cloud for performance tests is a bad idea and we should just give up and buy more hardware to run them ourselves? We were open to that possibility, however we wanted to do our due diligence before taking on the cost and complexity of owning and managing our own test cluster. Performance benchmarks in continuous integration MongoDB uses a CI platform called Evergreen to run tests on incoming commits. We also use Evergreen for running multiple classes of daily performance tests. In this project we are focused on our highest level tests, meant to represent actual end-user performance. We call these tests System Performance tests. For _System Performance_tests, we use EC2 to deploy real and relatively beefy clusters of c3.8xlarge nodes for various MongoDB clusters: standalone servers, 3 Node Replica Sets, and Sharded Clusters. These are intended to be representative of how customers run MongoDB. Using EC2 allows us to flexibly and efficiently deploy such large clusters as needed. Each MongoDB node in the cluster is run on its own EC2 node, and the workload is driven from another EC2 node. Repeatability There's an aspect of performance testing that is not obvious and often ignored. Most benchmarking blogs and reports are focusing on the maximum performance of a system, or whether it is faster than some competitor system. For our CI testing purposes, we primarily care about repeatability of the benchmarks. This means, the same set of tests for the same version of MongoDB on the same hardware should produce the same results whether run today or in a few months. We want to be able to detect small changes in performance due to our ongoing development of MongoDB. A customer might not get very upset about a 5% change in performance, but they will get upset about multiple 5% regressions adding up to a 20% regression. The easiest way to avoid the large regressions is to identify and address the small regressions promptly as they happen, and stop the regressions getting to releases or release candidates. We do want to stress MongoDB with a heavy load, but, achieving some kind of maximum performance is completely secondary to this test suite’s goal of detecting changes. For some of our tests, repeatability wasn't looking so good. In the below graph, each dot represents a daily build (spoiler -- you’ll see this graph again): Variability in daily performance tests Eyeballing the range from highest to lowest result, the difference is over 100,000 documents / second from day to day. Or, as a percentage, a 20-30% range. Investigation To reduce such variation from our test configuration, we set out on a project to reduce any environmental noise. Instead of focusing on the difference between daily MongoDB builds, we ran tests to focus on EC2 itself. Process: Test and Analyze Benchmarking is really an exercise of the basic scientific process: Try to understand a real world phenomenon, such as an application that uses MongoDB Create a model (aka benchmark) of that phenomenon (this may include setting a goal, like "more updates/sec") Measure Analyze and learn from the results Repeat: do you get the same result when running the benchmark / measuring again? Change one variable (based on analysis) and repeat from above We applied this benchmarking process to evaluate the noise in our system. Our tests produce metrics measuring the average operations per second (ops/sec). Occasionally, we also record other values but generally we use ops/sec as our result. To limit other variables, we locked the mongod binary to a stable release (3.4.1) and repeated each test 5 times on 5 different EC2 clusters, thus producing 25 data points. We used this system to run repeated experiments. We started with the existing system and considered our assumptions to create a list of potential tests that could help us determine what to do to decrease the variability in the system. As long as we weren’t happy with the results we returned to this list and picked the most promising feature to test. We created focused tests to isolate the specific feature, run the tests and analyze our findings. Any workable solutions we found were then put into production. For each test, we analyzed the 25 data points, with a goal of finding a configuration that minimizes this single metric: range = (max - min) / median Being able to state your goal as a single variable such as above is very powerful. Our project now becomes a straightforward optimization process of trying different configurations, in order to arrive at the minimum value for this variable. It's also useful that the metric is a percentage, rather than an absolute value. In practice, we wanted to be able to run all our tests so that the range would always stay below 10%. Note that the metric we chose to focus on is more ambitious than, for example, focusing on reducing variance. Variance would help minimize the spread of most test results, while being fairly forgiving about one or two outliers. For our use case, an outlier represents a false regression alert, so we wanted to find a solution without any outliers at all, if possible. Any experiment of this form has a tension between the accuracy of the statistics, and the expense (time and money) of running the trials. We would have loved to collect many more trials per cluster, and more distinct clusters per experiment giving us higher confidence in our results and enabling more advanced statistics. However, we also work for a company that needed the business impact of this project (lower noise) as soon as possible. We felt that the 5 trials per cluster times 5 clusters per experiment gave us sufficient data fidelity with a reasonable cost. Assume nothing. Measure everything. The experimental framework described above can be summarized in the credo of: Assume nothing. Measure everything. In the spirit of intellectual honesty, we admit that we have not always followed the credo of Assume nothing. Measure everything, usually to our detriment. We definitely did not follow it when we initially built the System Performance test suite. We needed the test suite up as soon as possible (preferably yesterday). Instead of testing everything, we made a best effort to stitch together a useful system based on intuition and previous experience, and put it into production. It’s not unreasonable to throw things together quickly in time of need (or as a prototype). However, when you (or we) do so, you should check if the end results are meeting your needs, and take the results with a large grain of salt until thoroughly verified. Our system gave us results. Sometimes those results pointed us at useful things, and other times they sent us off on wild goose chases. Existing Assumptions We made a lot of assumptions when getting the first version of the System Performance test suite up and running. We will look into each of these in more detail later, but here is the list of assumptions that were built into the first version of our System Performance environment: Assumptions: A dedicated instance means more stable performance Placement groups minimize network latency & variance Different availability zones have different hardware For write heavy tests, noise predominantly comes from disk Ephemeral (SSD) disks have least variance Remote EBS disks have unreliable performance There are good and bad EC2 instances In addition, the following suggestions were proposed as solutions to reducing noise in the system: Just use i2 instances (better SSD) and be done with it Migrate everything to Google Cloud Run on prem -- you’ll never get acceptable results in the cloud Results After weeks of diligently executing the scientific process of hypothesize - measure - analyze - repeat we found a configuration where the range of variation when repeating the same test was less than 5%. Most of the configuration changes were normal Linux and hardware configurations that would be needed on on-premise hardware just the same as on EC2. We thus proved one of the biggest hypotheses wrong: You can't use cloud for performance testing With our first experiment, we found that there was no correlation between test runs and the EC2 instances they were run on. Please note that these results could be based on our usage of the instance type; you should measure your own systems to figure out the best configuration for your own system. You can read more about the specific experiment and its analysis in our blog post EC2 instances are neither good nor bad . There are good and bad EC2 instances After running the first baseline tests, we decided to investigate IO performance. Using EC2, we found that by using Provisioned IOPS we get a very stable rate of disk I/O per second. To us, it was surprising that ephemeral (SSD) disks were essentially the worst choice. After switching our production configuration from ephemeral SSD to EBS disks, the variation of our test results decreased dramatically. You can read more about our specific findings and how different instance types performed in our dedicated blog post EBS instances are the stable option . Ephemeral (SSD) disks have least variance Remote EBS disks have unreliable performance -> PIOPS Just use i2 instances (better SSD) and be done with it (True in theory) Next, we turned our attention to CPU tuning. We learned that disabling CPU options does not only stabilize CPU-bound performance results. In fact, noise in IO-heavy tests also seems to go down significantly with CPU tuning. For write heavy tests, noise predominantly comes from disk After we disabled CPU options, the variance in performance decreased again. In the below graph you can see how changing from SSD to EBS and disabling CPU options reduced the performance variability of our test suite. You can read more about the CPU options we tuned in our blog post Disable CPU options . Improvements in daily performance measurements through changing to EBS and disabling CPU options At the end of the project we hadn’t tested all of our original assumptions, but we had tested many of them. We still plan to test the remaining ones when time and priority allow: A dedicated instance means more stable performance Placement groups minimize network latency & variance Different availability zones have different hardware Through this process we also found that previously suggested solutions would not have solved our pains either: Just use i2 instances (better SSD) and be done with it (True in theory) Migrate everything to Google Cloud: Not tested! Conclusion of the tests In the end, there was still noise in the system, but we had reduced it sufficiently that our System Performance tests were now delivering real business value to the company. Every bit of noise bothers us, but at the end of the day we got to a level of repeatability in which test noise was no longer our most important performance related problem. As such, we stopped the all out effort on reducing system noise at this point. Adding in safeguards Before we fully moved on to other projects, we wanted to make sure to put up some safeguards for the future. We invested a lot of effort into reducing the noise, and we didn’t want to discover some day in the future that things had changed and our system was noisy again. Just like we want to detect changes in the performance of MongoDB software, we also want to detect changes in the reliability of our test platform. As part of our experiments, we built several canary benchmarks which give us insights into EC2 performance itself based on non-MongoDB performance tests. We decided to keep these tests and run them as part of every Evergreen task, together with the actual MongoDB benchmark that the task is running. If a MongoDB benchmark shows a regression, we can check whether a similar regression can be seen in any of the canary benchmarks. If yes, then we can just rerun the task and check again. If not, it's probably an actual MongoDB regression. If the canary benchmarks do show a performance drop, it is possible that the vendor may have deployed upgrades or configuration changes. Of course in the public cloud this can happen at arbitrary times, and possibly without the customers ever knowing. In our experience such changes are infrequently the cause for performance changes, but running a suite of "canary tests" gives us visibility into the day to day performance of the EC2 system components themselves, and thus increases confidence in our benchmark results. The canary tests give us an indication of whether we can trust a given set of test results, and enables us to clean up our data. Most importantly, we no longer need to debate whether it is possible to run performance benchmarks in a public cloud because we measure EC2 itself! Looking forward This work was completed over 1.5 years ago. Since that time it has provided the foundation that all our subsequent and future work has been built upon. It has led to 3 major trends: We use the results. Because we lowered the noise enough, we are able to regularly detect performance changes, diagnose them, and address them promptly. Additionally, developers are also "patch testing" their changes against System Performance now. That is, they are using System Performance to test the performance of their changes before they commit them, and address any performance changes before committing their code. Not only have we avoided regressions entering into our stable releases, in these cases we’ve avoided performance regressions ever making it into the code base (master branch). We’ve added more tests. Since we find our performance tests more useful, we naturally want more such tests and we have been adding more to our system. In addition to our core performance team, the core database developers also have been steadily adding more tests. As our system became more reliable and therefore more useful, the motivation to create tests across the entire organization has increased. We now have the entire organization contributing to the performance coverage. We’ve been able to extend the system. Given the value the company gets from the system, we’ve invested in extending the system. This includes adding more automation, new workload tools, and more logic for detecting when performance changes. None of that would have been feasible or worthwhile without lowering the noise of the System Performance tests to a reasonable level. We look forward to sharing more about these extensions in the future. Coda: Spectre/Meltdown As we came back from the 2018 New Years holidays, just like everyone else we got to read the news about the Meltdown and Spectre security vulnerabilities. Then, on January 4, all of our tests went red! Did someone make a bad commit into MongoDB, or is it possible that Amazon had deployed a security update with a performance impact? I turned out that one of our canary tests - the one sensitive to cpu and networking overhead - had caught the 30% drop too! Later, on Jan 13, performance recovered. Did Amazon undo the fixes? We believe so, but have not heard it confirmed. Performance drops on January 4th and bounces back on January 13th The single spike just before Jan 13 is a rerun of an old commit. This confirms the conclusion that the change in performance comes from the system, as running a Jan 11 build of MongoDB after Jan 13, will result in higher performance. Therefore the results depend on the date the test was run, rather than which commit was tested. As the world was scrambling to assess the performance implications of the necessary fixes, we could just sit back and watch them in our graphs. Getting on top of EC2 performance variations has truly paid off. Update: @ msw pointed us to this security bulletin , confirming that indeed one of the Intel microcode updates were reverted on January 13. If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.

April 30, 2019
Engineering Blog

Repeatable Performance Tests: EC2 Instances are Neither Good Nor Bad

In an effort to improve repeatability, the MongoDB Performance team set out to reduce noise on several performance test suites run on EC2 instances. At the beginning of the project, it was unclear whether our goal of running repeatable performance tests in a public cloud was achievable. Instead of debating the issue based on assumptions and beliefs, we decided to measure noise itself and see if we could make configuration changes to minimize it. After thinking about our assumptions and the experiment setup , we began by recording data about our current setup. Investigate the status quo Our first experiment created a lot of data that we sifted through with many graphs. We started with graphs of simple statistics from repeated experiments: the minimum, median, and maximum result for each of our existing tests. The graphs allowed us to concisely see the range of variation per test, and which tests were noisy. Here is a representative graph for batched insert tests: High variance in the performance of batched insert tests In this graph you have two tests, each run three times at different thread levels (this is the integer at the end of the test name). The whiskers around the median, denote the minimum and maximum results (from the 25 point sample). Looking at this graph we observe that thread levels for these two tests aren't optimally configured. When running these two tests with 16 and 32 parallel threads, the threads have already saturated MongoDB, and the additional concurrency merely adds noise to the results. We noticed other configuration problems in other tests. We didn't touch the test configurations during this project, but later, after we had found a good EC2 configuration, we did revisit this issue and reviewed all test configurations to further minimize noise. Lesson learned: When you don't follow the disciplined approach of "Measure everything. Assume nothing." from the beginning, probably more than one thing has gone wrong. EC2 instances are neither good nor bad We looked at the first results in a number of different ways. One way showed us the results from all 25 trials in one view: Performance results do not correlate with the clusters they are run on As we examined the results, one very surprising conclusion immediately stood out from the above graph: There are neither good nor bad EC2 instances. When we originally built the system, someone had read somewhere on the internet that on EC2 you can get good and bad instances, noisy neighbours, and so on. There are even tools and scripts you can use to deploy a couple instances, run some smoke tests, and if performance results are poor, you shut them down and try again. Our system was in fact doing exactly that, and on a bad day would shut down and restart instances for 30 minutes before eventually giving up. (For a cluster with 15 expensive nodes, that can be an expensive 30 minutes!) Until now, this assumption had never been tested. If the assumption had been true, then on the above graph you would expect to see points 1-5 have roughly the same value, followed by a jump to point 6, after which points 6-10 again would have roughly the same value, and so on. However, this is not what we observe. There's a lot of variation in the test results, but ANOVA tests confirm that this variation does not correlate with the clusters the tests are run on. From this data, it appears that there is no evidence to support that there are good and bad EC2 instances. Note that it is completely possible that this result is specific to our system. For example, we use (and pay extra for) dedicated instances to reduce sources of noise in our benchmarks. It's quite possible that the issue with noisy neighbours is real, and doesn't happen to us because we don't have neighbours. The point is: measure your own systems; don't blindly believe stuff you read on the internet. Conclusion By measuring our system and analyzing the data in different ways we were able to disprove one of the assumptions we had made when building our system, namely that there are good and bad EC2 instances. As it turns out the variance in performance between different test runs does not correlate with the clusters the tests are run on. This is only one of three bigger experiments we performed in our quest to reduce variability in performance tests on EC2 instances. You can read more about the top level setup and results as well as how we found out that EBS instances are the stable option and CPU options are best disabled . If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.

April 30, 2019
Engineering Blog

Repeadtable Performance Tests: EBS Instances are the Stable Option

In an effort to improve repeatability, the MongoDB Performance team set out to reduce noise on several performance test suites run on EC2 instances. At the beginning of the project, it was unclear whether our goal of running repeatable performance tests in a public cloud was achievable. Instead of debating the issue based on assumptions and beliefs, we decided to measure noise itself and see if we could make configuration changes to minimize it. After thinking about our assumptions and the experiment setup , we began by recording data about our current setup and found no evidence of particularly good or bad EC2 instances . However, we found that the results of repeated tests had a high variance. Given our test data and our knowledge of the production system, we had observed that many of the noisiest tests did the most IO (being either sensitive to IO latency, or bandwidth). After performing the first baseline tests, we therefore decided to focus on IO performance through testing both AWS instance types and IO configuration on those instances. Investigate IO As we are explicitly focusing on IO in this step, we added an IO specific test (fio) to our system. This allows us to isolate the impact of IO noise to our existing benchmarks. The IO specific tests focus on: Operation latency Streaming bandwidth Random IOPs (IO per second) We look first at the IO specific results, and then our general MongoDB benchmarks. In the below graph, we are graphing the "noise" metric as a percentage computed from (max-min)/median and lower is better. c3.8xlarge with ephemeral storage is our baseline configuration which we were using in our production environment. i2.8xlarge shows best results with low noise on throughput and latency The IO tests show some very interesting results. The c3.8xlarge with EBS PIOPS shows less noise than the c3.8xlarge with its ephemeral disks. This was quite unexpected. In fact the c3.8xlarge with ephemeral storage (our existing configuration) is just about the worst choice. The i2.8xlarge looks best all around with low noise on throughput and latency. The c4.8xlarge shows higher latency noise than the c3.8xlarge. We would have expected any difference to favor the c4.8xlarge instances, as they are EBS optimized. After these promising results, we examined the results of our MongoDB benchmarks next. At the time that we did this work, MongoDB had two storage engines (wiredTiger and MMAPv1), with MMAPv1 being the default, but now deprecated, option. There were differences in the results between the two storage engines, but they shared a common trend. c3.8xlarge with PIOPS performs best with all results below 10% noise for the wiredTiger storage engine c3.8xlarge with PIOPS performs best with most results below 10% noise for the mmap storage engine There were no configurations that were best across the board. That said, there was a configuration with below 10% noise for all but 1 test: c3.8xlarge with EBS PIOPS. Interestingly, while i2 was the best for our focused IO tests, it was not for our actual tests. Valuable lessons learned: As far as repeatable results are concerned, the "local" SSDs we had been using performed worse compared to any other alternative we could have possibly chosen! Contrary to popular belief, when using Provisioned IOPS with EBS, the performance is both good in absolute terms, and very very stable! This is true for our IO tests and our general tests. The latency of disk requests does have more variability than the SSD alternatives, but the IOPS performance was super stable. For most of our tests, the latter is the important characteristic. The i2 instance family has a much higher performance SSD, and in fio tests showed almost zero variability. It also happens to be a very expensive instance type. However, while this instance type was indeed a great choice in theory, it turns out that our MongoDB test results were quite noisy. Upon further investigation, we learned that the noisy results were due to unstable performance of MongoDB itself. As i2.8xlarge has more RAM than c3.8xlarge, MongoDB on i2.8xlarge is able to hold much more dirty data in RAM. Flushing that much dirty data to disk was causing issues. Switching from ephemeral to EBS disks in production Based on the above results, we changed our production configuration to run on EBS disks instead of ephemeral SSD. (We were already running on c3.8xlarge instance types, which turned out to have the lowest noise in the above comparison, so decided to keep using those.) Performance becomes more stable when using EBS After running with the changes for a couple of weeks, you could clearly see how the day-to-day variation of test results decreased dramatically. This instantly made the entire System Performance project more useful to the development team and MongoDB as a whole. Conclusion Focusing on IO performance proved useful. As it turns out using Ephemeral (SSD) disks was just about the worst choice for our performance test. Instead, using Provisioned IOPS showed the most stable rate results. While i2 instances were the best in our non-MongoDB benchmark tests, they proved less than ideal in practice. This highlights quite clearly that you need to measure your actual system and assume nothing to get the best results. This is the second of three bigger experiments we performed in our quest to reduce variability in performance tests on EC2 instances. You can read more about the top level setup and results as well as how we found out that EC2 instances are neither good nor bad and that CPU options are best disabled . If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.

April 30, 2019
Engineering Blog

Casual Guarantees Are Anything but Casual

Traditional databases, because they service reads and writes from a single node, naturally provide sequential ordering guarantees for read and write operations known as "causal consistency". A distributed system can provide these guarantees, but in order to do so, it must coordinate and order related events across all of its nodes, and limit how fast certain operations can complete. While causal consistency is easiest to understand when all data ordering guarantees are preserved – mimicking a vertically scaled database, even when the system encounters failures like node crashes or network partitions – there exist many legitimate consistency and durability tradeoffs that all systems need to make. MongoDB has been continuously running — and passing — Jepsen tests for years. Recently, we have been working with the Jepsen team to test for causal consistency. With their help, we learned how complex the failure modes become if you trade consistency guarantees for data throughput and recency. Casual consistency defined To maintain causal consistency, the following guarantees must be satisfied: To show how causal guarantees provide value to applications, let’s review an example where no causal ordering is enforced. The distributed system depicted in Diagram 1 is a replica set. This replica set has a primary (or leader) that accepts all incoming client writes and two secondaries (or followers) that replicate those writes. Either the primary or secondaries may service client reads. Diagram 1: Flow of Operations in a Replica Set without Enforced Casual Consistency The client application writes order 234 to the primary The primary responds that it has successfully applied the write Order 234 is replicated from the primary to one of the secondaries The client application reads the orders collection on a secondary The targeted secondary hasn’t seen order 234, so it responds with no results Order 234 is replicated from the primary to the other secondary The client makes an order through the application. The application writes the order to the primary and reads from a secondary. If the read operation targets a secondary that has yet to receive the replicated write, the application fails to read its own write. To ensure the application can read its own writes, we must extend the sequential ordering of operations on a single node to a global partial ordering for all nodes in the system. Implementation So far, this post has only discussed replica sets. However, to establish a global partial ordering of events across distributed systems, MongoDB has to account for not only replica sets but also sharded clusters, where each shard is a replica set that contains a partition of data. To establish a global partial ordering of events for replica sets and sharded clusters, MongoDB implemented a hybrid logical clock based on the Lamport logical clock . Every write or event that changes state in the system is assigned a time when it is applied to the primary. This time can be compared across all members of the deployment. Every participant in a sharded cluster, from drivers to query routers to data bearing nodes, must track and send their value of latest time in every message, allowing each node across shards to converge in their notion of the latest time. The primaries use the latest logical time to assign new times to subsequent writes. This creates a causal ordering for any series of related operations. A node can use the causal ordering to wait before performing a needed read or write, ensuring it happens after another operation. For a deeper dive on implementing cluster-wide causal consistency, review Misha Tyulenev’s talk . Let’s revisit our example from Diagram 1 but now enforcing causal consistency: Diagram 2: Flow of Operations in a Replica Set with Enforced Casual Consistency The client application writes order 234 to the primary The primary responds that it has successfully recorded the write at time T1 Order 234 is replicated from the primary to one of the secondaries The client application reads after time T1 on a secondary The targeted secondary hasn’t seen time T1, so must wait to respond Order 234 is replicated from the primary to the other secondary The secondary is able to respond with the contents of order 234 Write and read concerns Write concern and read concern are settings that can be applied to each operation, even those within a causally consistent set of operations. Write concern offers a choice between latency and durability. Read concern is a bit more subtle; it trades stricter isolation levels for recency. These settings affect the guarantees preserved during system failures Write concerns Write concern, or write acknowledgement, specifies the durability requirements of writes that must be met before returning a success message to the client. Write concern options are: Only a successful write with write concern majority is guaranteed to be durable for any system failure and never roll back. During a network partition, two nodes can temporarily believe they are the primary for the replica set, but only the true primary can see and commit to a majority of nodes. A write with write concern 1 can be successfully applied to either primary, whereas a write with write concern majority can succeed only on the true primary. However, this durability has a performance cost. Every write that uses write concern majority must wait for a majority of nodes to commit before the client receives a response from the primary. Only then is that thread freed up to do other application work. In MongoDB, you can choose to pay this cost as needed at an operation level. Read concern Read concern specifies the isolation level of reads. Read concern local returns locally committed data whereas read concern majority returns data that has been reflected in the majority committed snapshot that each node maintains. The majority committed snapshot contains data that has been committed to a majority of nodes and will never roll back in the face of a primary election. However, these reads can return stale data more often than read concern local . The majority snapshot may lack the most recent writes that have not yet been majority committed. This tradeoff could leave an application acting off old data. Just as with write concern, the appropriate read concern can be chosen at an operation level. Effect of write and read concerns With the rollout of causal consistency, we engaged the Jepsen team to help us explore how causal consistency interacts with read and write concerns. While we were all satisfied with the feature’s behavior under read/write concern majority, the Jepsen team did find some anomalies under other permutations. While less strict permutations may be more appropriate for some applications, it is important to understand the exact tradeoffs that apply to any database, distributed or not. Failure scenario examples Consider the behavior of different combinations of read and write concerns during a network partition where P1 has been partitioned from a majority of nodes and P2 has been elected as the new primary. Because P1 does not yet know it is no longer the primary, it can continue to accept writes. Once P1 is reconnected to a majority of nodes, all of its writes since the timeline diverged are rolled back. Diagram 3: Network Partition Timeline During this time, a client issues a causal sequence of operations as follows: At Time T1 perform a write W1 At Time T2 perform a read R1 The following four scenarios discuss the different read and write concern permutations and their tradeoffs. Read Concern majority with Write Concern majority Diagram 4: Read Concern majority with Write Concern majority The write W1 with write concern majority can only succeed when applied to a majority of nodes. This means that W1 must have executed on the true primary’s timeline and cannot be rolled back. The causal read R1 with read concern majority waits to see T1 majority committed before returning success. Because P1, partitioned from a majority of nodes, cannot progress its majority commit point, R1 can only succeed on the true primary’s timeline. R1 sees the definitive result of W1. All the causal guarantees are maintained when any failure occurs. All writes with write concern majority prevent unexpected behavior in failure scenarios at the cost of slower writes. For their most critical data, like orders and trades in a financial application, developers can trade performance for durability and consistency. Read Concern majority with Write Concern 1 Diagram 5: Read Concern majority with Write Concern 1 The write W1 using write concern 1 may succeed on either the P1 or P2 timeline even though a successful W1 on P1 will ultimately roll back. The causal read R1 with read concern majority waits to see T1 majority committed before returning success. Because P1, partitioned from a majority of nodes, cannot progress its majority commit point, R1 can only succeed on the true primary’s timeline. R1 sees the definitive result of W1. In the case where W1 executed on P1, the definitive result of W1 may be that the write did not commit. If R1 sees that W1 did not commit, then W1 will never commit. If R1 sees the successful W1, then W1 successfully committed on P2 and will never roll back. This combination of read and write concerns gives causal ordering without guaranteeing durability if failures occur. Consider a large scale platform that needs to quickly service its user base. Applications at scale need to manage high throughput traffic and benefit from low latency requests. When trying to keep up with load, longer response times on every request are not an option. The Twitter posting UI is a good analogy for this combination of read and write concern: The pending tweet, shown in grey, can be thought of as a write with write concern 1 . When we do a hard refresh, this workflow could leverage read concern majority to tell the user definitively whether the post persisted or not. Read concern majority helps the user safely recover. When we hard refresh and the post disappears, we can try again without the risk of double posting. If we see the post after a hard refresh at read concern majority , we know there is no risk that post ever disappearing. Read Concern local with Write Concern majority Diagram 6: Read Concern local with Write Concern majority The write W1 with write concern majority can only succeed when applied to a majority of nodes. This means that W1 must have executed on the true primary’s timeline and cannot be rolled back. With read concern local , the causal read R1 may occur on either the P1 or P2 timeline. The anomalies occur when R1 executes on P1 where the majority committed write is not seen, breaking the "read your own writes" guarantee. The monotonic reads guarantee is also not satisfied if multiple reads are sequentially executed across the P1 and P2 timelines. Causal guarantees are not maintained if failures occur. Consider a site with reviews for various products or services where all writes are performed with write concern majority and all reads are performed with read concern local . Reviews require a lot of user investment, and the application will likely want to confirm they are durable before continuing. Imagine writing a thoughtful two-paragraph review, only to have it disappear. With write concern majority , writes are never lost if they are successfully acknowledged. For a site with a read heavy workload, greater latency of rarer majority writes may not affect performance. With read concern loca , the client reads the most up-to-date reviews for the targeted node. However, the targeted node may be P1 and is not guaranteed to include the client's own writes that have been successfully made durable on the true timeline. In addition, the node’s most up-to-date reviews may include other reviewers' writes that have not yet been acknowledged and may be rolled back. Read Concern local with Write Concern 1 Diagram 7: Read Concern local with Write Concern 1 The combination of read concern local and write concern 1 has the same issues as the previous scenario but now the writes lack durability. The write W1 using write concern 1 may succeed on either the P1 or P2 timeline even though a successful W1 on P1 will ultimately roll back. With read concern local , the causal read R1 may occur on either the P1 or P2 timeline. The anomalies occur when W1 executes on P2 and R1 executes on P1 where the results of the write is not seen, breaking the "read your own writes" guarantee. The monotonic reads guarantee is also not satisfied if multiple reads are sequentially executed across the P1 and P2 timelines. Causal guarantees are not maintained if failures occur. Consider a sensor network of smart devices that does not handle failures encountered when reporting event data. These applications can have granular sensor data that drives high write throughput. The ordering of the sensor event data matters to track and analyze data trends over time. The micro view over a small period of time is not critical to the overall trend analysis, as packets can drop. Writing with write concern 1 may be appropriate to keep up with system throughput without strict durability requirements. For high throughput workloads and readers who prefer recency, the combination of read concern local and write concern 1 delivers the same behavior of primary only operations across all nodes in the system with the aforementioned tradeoffs. Conclusion Each operation in any system, distributed or not, makes a series of tradeoffs that affect application behavior. Working with the Jepsen team pushed us to consider the tradeoffs of read and write concerns when combined with causal consistency. MongoDB now recommends using both read concern majority and write concern majority to preserve causal guarantees and durability across all failure scenarios. However, other combinations, particularly read concern majority and write concern 1 , may be appropriate for some applications. Offering developers a range of read and write concerns enables them to precisely tune consistency, durability, and performance for their workloads. Our work with Jepsen has helped better characterize system behavior under different failure scenarios, enabling developers to make more informed choices on the guarantees and trade-offs available to them. If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.

October 23, 2018
Engineering Blog

Pruning Dynamic Rebuilds With libabigail

Complex C++ projects frequently struggle with lengthy build times. Splitting a project into multiple dynamically-linked components can give developers faster incremental rebuilds and shorter edit-compile-test cycles than relying on static linking, especially when there are a large number of test binaries. However, build systems usually do not realize all of the possible gains in dynamic incremental rebuilds due to how they handle transitive library dependencies. Red Hat's ABI introspection library libabigail offers one possible path to eliminating unnecessary transitive re-linking for some classes of source modifications. The problem Consider the following toy project containing two libraries: libserver and libclient . The server library libserver depends on the client library libclient for wire protocol code, and both the client and server support library implementations which depend on a library of common utilities libcommon. The client and server executables each use the associated support libraries. We can see a more complete picture of our dependency graph by considering the header files and source files that are used to build these libraries, as well as the intermediate targets such as object files. We will assume that each library has one header and one source file. The dependency graph now looks like the following: Finally, we assume a build system that can use content signatures to skip rebuilds when a dependency is regenerated with identical results. A build system that only uses timestamps cannot capitalize on the technique outlined below because regenerated dependencies always have newer timestamps. In this environment, what will be rebuilt if we make a meaningful change to libcommon.hpp or libcommon.cpp , and ask for the client and server binaries to be built? Well, changing libcommon.hpp is a disaster! We need to recompile libcommon.cpp , generating a new libcommon.o , and therefore a new libcommon.[a|so] . Similarly, since both libclient.cpp and libserver.cpp depend on libcommon.h , they need to be recompiled and the associated libraries rebuilt. Since the libserver and libclient support libraries were relinked, the executables are now out of date, so they also get relinked. The only work we avoided doing was recompiling client.cpp and server.cpp , since they don’t directly depend on libcommon.hpp . Ouch. Well, that’s C++ for you. Perhaps C++ modules will improve this situation, but we don’t live in that world yet. The following diagrams demonstrate this graphically, where: The darkest red box is the entity which was directly changed. The intermediate red indicates an entity that is rebuilt because one of its direct dependencies is seen by the build system as changed. The lightest red is an unaltered entity that is seen as out of date by the build system due to a change in an implicit dependency like a header inclusion. Changing just libcommon.cpp isn’t much better. We avoid needing to recompile lib{client,server}.cpp , but we still do a bunch of relinking. Here is how that looks for a static build: Note that we have cheated a bit in our diagram: in a static build lib{client,server}. a don’t really depend on libcommon.a . Instead, server and client depend on it directly. So the fact that libcommon.a changed doesn’t require us to re-run the archiver for libclient.a and libserver.a . But drawing it that way makes the diagram a lot messier. The dynamic build is actually worse here, because in that case we do need to relink libclient.so and libserver.so since their link-time dependency libcommon.so is newer/changed. Potentially, you might get away with not relinking client and server in the dynamic case, since the relink of libclient.so and libserver.so may very well produce identical results in this situation, and a content signature based build system would notice that. But in practice, libcommon.so is likely also going to show up on the link line for client and server, since otherwise static builds won’t work. So unless you have written or generated varying library dependency lists for static and dynamic builds, libcommon.so is very likely to be on the link line for client and server too, making them additional link-time casualties. An insight Let us imagine that the change to libcommon.cpp was something small and innocuous, maybe fixing a typo in an internal string constant that gets logged. In a small example like this, it isn’t too painful that we needed to relink so many things. But in a larger project it can definitely hurt. It feels wrong to do so much linking for such a little change. Especially in the dynamic build, a small change deep in the library dependency graph can lead to a long chain of transitive relinking, even when many of the libraries are completely unaltered. Can we do better? With static linking, no, not really. That updated string constant needs to exist in both executables, so we really need to relink them so that the new string constant is extracted from libcommon.a . With dynamic linking, it turns out that we can do better. The key observation is to consider what would happen if we rebuilt only libcommon.so , and intentionally didn’t relink the other dynamic libraries or executables (even though the build system thinks we should), and then tried to run the executables. Would they work, and work as expected? For the case of our proposed private string constant modification in libcommon.cpp , the answer is a definite yes. Changing that internal string constant didn’t alter the Application Binary Interface ( ABI ) of libcommon.so in any way, and when the executables are run, the updated value of the string constant will be reflected in the output because the string constant wasn’t copied into the executables: it lives in the now replaced libcommon.so . We got away with this because our change didn’t alter the ABI of libcommon.so . If we had made a change that altered its ABI, rebuilt libcommon.so , and then tried running the executables without relinking them, we would probably be looking at a very subtle runtime crash. No fun. In theory then you could minimize relinking by individually naming targets to build when you knew that you had ABI affecting changes. But in practice that is clearly error prone and just a terrible idea. But if we had a tool that could tell us when the ABI of a library had changed, then we could teach our build system how to invoke this tool as it did its dependency walk, and automatically skip any unnecessary relinks in the case of ABI preserving modifications. A solution Fortunately, Red Hat has provided just such a tool as part of their new library libabigail . As they describe it: the project aims at providing a library to manipulate ABI corpora, compare them, provide detailed information about their differences and help build tools to infer interesting conclusions about these differences. The abidw tool that comes with libabigail reads a shared library, consults the associated ELF and DWARF information which together encode all information relevant to the ABI, and emits an XML document that describes the library ABI. Taking advantage of the flexibility of SCons, we can augment it to invoke abidw on a library immediately after we build it and compute a hash of the resulting ABI XML, then store that hash in a file alongside the library. When another target declares that it links to the library, we tell SCons to record a dependency on the ABI hash file instead of a dependency on the library itself. As a result, if the library is relinked but its ABI doesn’t change, then the ABI hash file will have the same contents. Since SCons uses content signatures to detect whether targets are out of date, the ABI hash file is seen as up to date, even if it was regenerated. Since that dependency is seen as up to date, the depending target is also considered up to date. ABI preserving modifications to a library no longer cause dependents to relink! The following diagram updates our original to include the associated ABI hash files and strips out some now unhelpful sources and headers. We now also distinguish between a dependency relationship (solid line), and a links-to/requires relationship (dashed line): Now, if we make an ABI-affecting change to libcommon.so , we see that libclient.so and libserver.so are relinked. But libclient.so and libserver.so only use libcommon.so internally, so their ABI has not changed. The client and server executables do not need to be relinked: On the other hand, if we make a change to libcommon.so that does not affect the ABI, then nothing else gets relinked: Correctness Is it safe to do this? We believe it to be. We have not yet thought of any cases where it is not. Additionally, a false positive (e.g. a claim that ABI changed when it didn’t) only costs us a missed optimization. A false negative would be harmful, but would represent a serious bug in libabigail . Additionally, we are currently only offering this facility as an opt-in for developer builds; the builds that we ship to customers do not use it. There are a few important correctness issues to be aware of, however: There is a poor interaction with the -gsplit-dwarf flag for debug fission . libabigail uses the elfutils library for its DWARF processing, and elfutils as yet doesn’t know how to reach out to the .dw{o,p} files that - gsplit-dwarf and the associated tooling creates. Since libabigail relies on the DWARF info to identify the ABI, running abidw on a library built from objects built with -gsplit-dwarf gives incorrect results. So you can’t use both ABI driven linking and debug fission at the same time. Presumably, this limitation will be lifted as support for the new DWARF 5 standard is incorporated into elfutils . The libabigail library is new, and we have had several instances where it crashed when working with our libraries. Dodji Seketeli, the author of libabigail , has been very helpful and responsive addressing those crashes, but you will need to have a fairly bleeding edge version of abidw available if you want this technique to work well in practice. Taking full advantage of the technique requires that symbol visibility annotations be correctly applied to type and function definitions, and typically that all code can be built with -fvisibility-hidden . Otherwise, entities which do not actually form part of the interface to the library are still exported and therefore are counted as part of the ABI by libabigail , leading to spurious relinking. Performance Is this solution performant? Unfortunately, the answer right now is a resounding “it depends”. The mongo::Status class is compiled into a library on which almost all other libraries and executables in the MongoDB server project depend. After making a non-ABI-altering edit to its implementation file, a rebuild of the all target on my machine is 40% faster when using abidw to skip relinks than when not. That is a fairly compelling win, but it is also the best case scenario. The worst case scenario is pretty bad. There is a large cost to running abidw on each library. For some complex libraries it can be quite slow: running abidw on the SpiderMonkey JS engine takes upwards of 30 seconds. In terms of total compute time, a full relink with abidw takes about twice as much total CPU time as a full relink without. Another way to look at it is that running abidw is about as expensive as linking a second time. So using abidw may be worth it if you are working deep in the link graph, and your work admits a high degree of link avoidance, but it may not be worth it if you are doing work that will cause lots of ABI changes. Unfortunately it is hard to know up front which you are likely to do. On the other hand, if you have subsets of the tree that change very infrequently, the cost for those subsets is amortized over many builds. Overall, further work on performance is likely required. Or, maybe it isn’t… If proper header discipline is fully honored in a codebase, where every ABI relevant function or object has a unique declaration in a header, it should be impossible to make a source code change that leads to an ABI variation in a library but does not cause all dependent libraries or programs to be rebuilt. In such a world, it should be then possible to weaken the build system rules for linking shared libraries to induce an order-only relationship, rather than a strict dependency. That would entirely obviate the need for using libabigail to detect ABI variation. Is it reasonable to expect such discipline? Are there ways to mechanically enforce it? Are there ways to subvert ABI compatibility despite such discipline? Would C++ modules offer that capability? Depending on the answers to those and similar questions, it might be better to invest time implementing that approach and associated tooling, rather than relying on ABI metadata. Future directions If using ABI metadata does prove to be the correct approach, there are some areas where the current implementation could be improved: Per the discussion above, general improvements to the speed of abidw are necessary. Other potential avenues to improve performance might include writing a compiler or linker plugin that could emit an ABI description analogous to that produced by abidw concurrently with executing the link step, obviating the need for a second pass by abidw . Our current practice of using pipes and file redirection in the Command body of the SCons tool is somewhat dangerous . We could change it to actually just emit the full XML into the .abidw file via the abidw --out-file option , and allow the SCons internal signature generation mechanisms to compute the hash. However, this would end up writing hundreds of megabytes of information we don’t actually care about to disk, and clutter up the SCons cache. Potentially, adding a compression option to abidw would be an effective remediation. The total size of abidw data generated by a full build of MongoDB is 203 MB, but a simple gzip of each file brings it down to 14 MB. Another option to improve performance would be to eliminate XML generation entirely. We are currently doing a lot more work than needed because we are generating XML that is then just feed right into MD5 to compute a signature. If the libabigai library had an abihash program that just emitted a signature directly, we could probably somewhat improve the performance of the tool. We could probably eliminate even more rebuilds by not just detecting whether the ABI changed, but by using other libabigail utilities like abidiff to do ABI compatibility detection and only relink when there was a non-ABI compatible change in a dependency. This would allow new functions to be added to the library without requiring libraries that do not depend on those new functions to relink. We will investigate this in the future, but it probably would require a significantly more complex build system integration. The libabigail library currently only works on ELF platforms. It might be possible to make it work on macOS because its debug info is also DWARF, but it would require significant effort to make it work with the MACH-O parts of the binary. There is definitely no support for Windows, though I’m curious as to whether the import library generated as part of DLL construction may contain enough information to identify the ABI. If you know, please reach out and let me know, or reply on the StackOverflow question on that topic . Conclusion Overall, has this approach to relink avoidance been successful? Our current view is that the tool is not enough of a consistent win to deploy as the default for developer builds, but that the potential gains are compelling enough that we will continue to pursue the performance improvements and future directions outlined above. Should libabigai l prove to be the right approach, we intend to invest time into the SCons integration to address some of the deficiencies and limitations identified above. If those issues can be resolved satisfactorily, we ultimately hope to see the tool merged into the SCons mainline. We also hope to work with the libabigail maintainers to further improve its feature set and performance. And, finally, even if the specific approach of using abidw to skip relinks proves non-viable, we are pleased with the insights that were incidentally developed regarding header discipline and linking, which may ultimately provide a zero cost way to achieve the same goal. If you are interested in experimenting with the tool, it is available as an Apache 2 licensed drop-in tool for the SCons build system. Thoughts, feedback, and bug reports are most welcome. I’d like to thank Dodji Seketeli of Red Hat for writing libabigail, for his prompt responses to all of the issues that I have opened over the past year, and for his help reviewing this blog post. I’d also like to thank Mathias Stearn for his review of this post and thoughts about using header discipline to entirely eliminate the need for ABI metadata.

April 3, 2018
Engineering Blog

Considering the Community Effects of Introducing an Official MongoDB Go Driver

What do you do when an open-source project you rely on no longer meets your needs? When your choice affects not just you, but a larger community, what principles guide your decision? Submitting patches is often the first option, but you're at the mercy of the maintainer to accept them. If the changes you need are sweeping, substantial alterations, the odds of acceptance are low. Eventually, only a few realistic options remain: find an alternative, fork the project, or write your own replacement. Everyone who depends on open source faces this conundrum at one time or another. After relying for years on the community-developed mgo Go driver for MongoDB, MongoDB has begun work on a brand-new, internally-developed, open-source Go driver. We know that releasing a company-sponsored alternative to a successful, community-developed project creates tension and uncertainty for users, so we did not make this decision lightly. We carefully considered how our choice would affect current and future Go users of MongoDB. First, some history: Gustavo Niemeyer first announced the mgo community driver in March, 2011 – around the same time that MongoDB released version 1.8.0 of the database. It currently has over 1,800 stars on GitHub and 32 contributors – including several current and former MongoDB employees. The incredible success of MongoDB in the Go community owes a great deal to Gustavo and mgo. MongoDB itself is part of this community. As the Go language matured and gained in popularity, MongoDB found many uses for it internally. Some of the projects using it include: Our remote agents for automated deployment, for backup, and for monitoring. Our command-line operations tools, like mongodump. (Re-written in Go for the 3.0 server release). Our home-grown continuous integration system, Evergreen . Our cloud products, like MongoDB Atlas and Stitch have major components written in Go. From this experience, our engineers contributed back to mgo: over half a dozen employees have commits in mgo, accounting for over 2000 lines of changes. But the more we used mgo, the more we discovered limitations. With our in-house drivers – covering popular languages with deep commercial adoption – we often start driver feature development in parallel with server feature development so that we can test them as soon as the server merges a feature. But as a community project, mgo's feature support generally lags MongoDB server development. More critically, our products that use mgo can't easily test against or take advantage of new server features. Even if we thought that Go didn't yet have critical mass in our user base to justify an in-house driver, our own company's products can't wait for new features. Sometimes, we patched a private copy of mgo to implement new features we critically needed. This isn't always easy. In 2015, we announced our next generation drivers , built upon a published set of specifications for driver behavior. Because mgo predates this work, its conventions and internals don't match our specifications. When the server implements new features and the driver development team writes specs to match, these new specs assume implementation of prior specs. Developing comparable features in mgo can mean starting from a completely different base. Not only does mgo have different internal conventions and behaviors than our in-house drivers, it encapsulates these behaviors in ways we found constraining. Usually, encapsulation is a good thing – a sign of good design – but many of our products benefit from low-level access to sockets, wire protocol models and encoding. End-users don't need this access, but we have the knowledge to work with our own communication protocols and message formats safely and to great effect. We wanted to invite people who wanted something more to try something new, rather than – via forking – implicitly asking people to pick sides in a project they already use. For example, our mongoreplay tool lets users replay a tcpdump of MongoDB server requests against a different server or cluster. When replaying the workload, we need server connection and authentication features – part of mgo's public API – but to replicate per-connection traffic we also need direct control over the number of socket connections and the socket message traffic, all of which is private. To enqueue requests and to read responses we need access to the types representing the wire protocol messages – also private types that are never visible to end users. Over time, we found ourselves copying-and-pasting parts of mgo source into project-specific libraries, or re-implementing parts of the wire protocol or driver behaviors directly. There is a real cost in the time it takes engineers to patch mgo or to write, fix and extend a plethora of internal libraries, plus opportunity costs of having our own products not being able to use our own server's latest features. We decided to consolidate and standardize on one implementation to address all these needs. We considered two alternatives: Fork mgo completely – developing at our pace, modifying internals as needed, and extending the APIs to suit our needs. Develop a new driver – building from the ground up to our specifications, putting it on par with our other officially-maintained drivers. Forking mgo would have a handful of benefits but many challenges. In the benefits column, forking would minimize the impact on our existing products that use mgo as well as for any user who chose to use our fork over the original. In the challenges column, we identified both technical and social considerations that gave us pause. On the technical side, a fork wouldn't solve the large gap to our common specifications, making new feature development much harder than for our internally-developed drivers. It also raises a tough question: what if we implement a new feature in our fork only to find that mgo implements it a different way? The more we might take the internal architecture and the API in a different direction from mgo, the harder it would be keep our fork a "drop-in" replacement and the harder it would be to send patches upstream or to merge in upstream development. We felt a fork would quickly become an independent, backwards-incompatible product, despite a common lineage – undercutting the alleged benefit of forking. On the social side, we knew that anything we released – whether a fork or a new driver – could have a disruptive effect on the existing mgo community. We didn't want to discourage anyone happy using mgo with MongoDB from continuing to use it. We wanted to invite people who wanted something more to try something new, rather than – via forking – implicitly asking people to pick sides in a project they already use. Forking could also imply that we would take on mgo's technical debt, which we wanted to avoid. In light of these challenges, we decided instead to write a new, independently-developed Go driver to join the eleven other drivers in our officially-maintained driver ecosystem . A fresh start allows us to focus our efforts on four main benefits: Velocity: once complete, the new Go driver will evolve as fast as the server does. We'll be able to dog-food new features internally before each server GA release. Consistency: the new Go driver will follow our common specifications from the outset, so the new driver API will feel like other MongoDB drivers, shortening the learning curve for users. We'll also be staying idiomatic to Go, such as supporting context objects for cancellable requests. Performance: a new driver gives an opportunity to provide a new, higher-performance BSON library and design the driver API in a way that gives users more control over memory allocations. Low-level API: for our own in-house products and other power users, we will provide low-level components for reuse, reducing code duplication across the company. Unlike the rest of the driver, this API will have no stability guarantee and no end-user support, but it will let us develop better products faster and our users will benefit that way. Fortunately, we were able to start from a prototype driver custom developed for our BI Connector – written by a former driver engineer – and build from that base towards the common driver specification. We're now finalizing the details of the new BSON library and the core CRUD API. What's next for the driver? In the coming months, we'll ship an "alpha" release of the Go driver and make the code repository public. At that point we’ll ask members of the Go-using MongoDB community to try it out and help us improve it with their feedback. Update, 2/19/2018: The new driver is now in alpha, please read the announcement for more info about trying it out .

January 11, 2018
Engineering Blog

Investing in CS4All: One Year Later

When a couple of New York City high school teachers partnered with MongoDB to teach computer science, did they succeed? Their curriculum was untested, and they were teaching in difficult districts where most students are from poor and minority families. I talked with these two teachers, Jeremy Mellema and Timothy Chen, back in September , when they had completed a summer fellowship at MongoDB and had just started teaching their curriculum; at the end of the academic year this spring, I visited Jeremy and Tim again to find out the result. Their successes were sparse and partial. They discovered that their students' poor reading skills were a barrier to learning to code, and that teaching new coders how to solve problems is, itself, an unsolved problem. With a coarse unit of iteration—a school semester—it is painfully slow to experiment and find teaching methods that work. But even partial wins make a difference for individual kids, and the support of professional engineers at companies like MongoDB can be a powerful accelerant. What engages students Jeremy's main struggle was to get his students excited about code. He was assigned to teach a computer science class at Bronx Compass High School in the fall, using the curriculum he wrote during his fellowship at MongoDB last summer. In the beginning he spent too much time lecturing. “It felt weird,” he said. “It should be more like, ‘Let’s get down and dirty,’ and not, ‘Let’s have me talk to you.’” Even when his students did get their hands on computers, the first exercises were simply retyping Python scripts from a textbook. The payoff, watching a script run without throwing an exception, was hardly satisfying to them. Things started to click when he introduced Python turtle graphics , which gave the class more obvious evidence of their accomplishments. It also allowed Jeremy better opportunities to motivate and engage his students directly. “Some days I would challenge them to see who could make the craziest drawing,” Jeremy says. He would tell his students, “That’s so cool. I only made a star. You definitely beat me today.” Jeremy teaches both history and computer science, and he finds that some of his lowest-performing history students are his best CS students. “It’s satisfying to see them in their element,” he says. In Jeremy's view a computer science class can touch a student's intellect just as deeply as history. “People are multifaceted. You’re not only who you are when you’re in my history class.” Jeremy's computer science class was cancelled this spring; the students at Bronx Compass High School are behind on history credits and there are only three history teachers on staff. For now, computer science is merely an elective, so Jeremy is back teaching history full-time. “I really miss teaching CS,” he says. If he resumes the course, Jeremy thinks it must be livelier. He is reconsidering his use of the videos from How To Think Like a Computer Scientist, which he studied last summer on the recommendation of his mentor Shannon Bradshaw, MongoDB’s vice president of education. The content helped Jeremy train to teach CS, but when he showed the videos to his kids they were bored. Jeremy hopes to make new videos that will draw them in. His students from the fall semester say he should get their advice. Otherwise, they warned, “you might do something that you think is cool but it’s actually super corny.” A head start Although there is no computer science elective this semester, some students are pursuing the topic in other ways. A young woman from his class in the fall, Tatyana Camacho, now interns for the high school’s IT department. I had quoted her in my previous article, and Jeremy tells me she loved it. She commanded him to show her father in the next parent-teacher conference: “You need to show my dad that I’m one of the advanced students.” Jeremy still runs the afternoon Computer Club. I visited the club to meet a student, Daniel Rodriguez, who was tinkering with an Arduino and a circuit board that the school provided. “I don't have the ability to get this equipment otherwise, in my predicament,” says Daniel. He starts his Arduino projects by copying examples. The wiring is easier than the coding for him, he told me, "especially because I'm not the best speller in the world." Once he has the example working he modifies it to his own taste. Most recently, he wanted to show a message but, with only LEDs, he can’t display much. He researched Morse Code and made a light flash the code for “HELLO”, like any programmer demonstrating a system for the first time. “Most people think that once you plug something in, that’s it, it works,” says Daniel. “But I’m the person that makes the circuit run. I tell people, ‘I made it do that.’ And seeing them fascinated by what I did, it makes me, in turn, fascinated by what I'm doing.” Daniel has to return the Arduino at the end of the year. Next year he’ll go to a trade school for electricians. Working with the Arduino will give him an advantage, he hopes, and it seems plausible to me. As he finishes school and starts work as an electrician the world will be changing around him: smart appliances and programmable components will be everywhere. An electrician who loves to code will have a big head start. Making anything they want Timothy Chen teaches in Hell’s Kitchen, at Urban Assembly Gateway School for Technology. I visited his class in May to see how his students had progressed since I last saw them in September. They were involved in a multi-week project called the AP Create Task, part of a national Advanced Placement exam. “They are allowed to make literally anything they want, in any language,” says Tim. Students submit their code and a one-minute video of the program in action, and they may describe their project either in writing or in audio narration. I was surprised by how 21st Century the test is, and how accommodating it could be to students with a deficit in reading and writing. It must be remarkably difficult, however, to score fairly. The Create Task is many students’ first time scoping and integrating a sizable project, and there were flameouts. One young man tried to make a maze game drawn with ASCII characters; it proved too ambitious and he ran out of time. Tim isn’t supposed to help students define the scope of their projects, but if they announce they’re going to tackle something difficult he will push them to list all the components. In the best case, they realize they don’t know how to do most of the project and choose something simpler. One of Tim’s students, Jahseem Maxwell, was building a Go Fish card game in Python, and she was having trouble integrating the pieces. “It has to be a certain order and it's hard to make that order when you don't know, really, what you’re doing. I’m struggling, putting it all together.” Another student, Cecilia Gonzalez, was writing a Choose Your Own Adventure game. She says the AP Create Task encourages students to work in pairs. “We work sort of together but not exactly.” Each must create at least one significant part of the program independently. Cecilia’s game is based on a monster of urban legend called The Rake, which comes closer when you think about it. The game begins by asking questions such as the player’s name and height. She told me the player’s answers will determine “some things that are going to happen,” but she didn’t give away any spoilers. When Tim began teaching the class in September, he hadn’t written the ending yet, either. His greatest fear was his students would learn the curriculum faster than he could write it. By May it was clear that wasn’t a problem. “Some of the students can’t read very well, and that was a big barrier because all the things I made were text,” he says. “Everything just took longer than expected.” Problem solving How do you teach problem solving? This is Tim’s great unanswered question from the year. Perhaps if high school computer science were taught like math, as a series of small problems with only one right answer each, then how to solve those problems wouldn't be such a mystery. But high school CS is taught like art class. Tim’s students invent new projects and somehow solve the unpredictable problems that arise in them. Tim speculates that he would learn problem solving himself by watching an experienced programmer solve a new problem, hit roadblocks, and overcome them. Indeed, that is how I have taught problem solving to MongoDB interns. Together, we attack problems without knowing the answers beforehand. It requires an entire summer of one-on-one collaboration. “I don't think that model works very with the kids,” says Tim, “especially if they are not very good with sitting still for an extended period. I'm not sure how to reach them.” Falling in love Tim, like Jeremy, wants to make more multimedia to reach students despite their poor reading skills. “I want to rethink how it should be done before I start this time. I kind of jumped into it too quickly.” Tim’s main goal is to give kids the chance to fall in love with programming and continue on their own. Many other goals are still out of reach: students at his school score low on the AP test, and few of them are likely to get a college degree in CS or be professional coders. Still, Tim hopes that a more varied course, with audio and video, could bring students farther. “The big hurdle for everyone is teaching problem solving. If I can get that, everything else is easy. I'm still trying to figure out how to do that.”

October 3, 2017
Engineering Blog

Farewell, Solaris

Solaris was the first “ real operating system ” I ever used. The Brown University Computer Science Department was a Sun Microsystems shop when I was an undergraduate there in the late 90s. When I took the operating systems lab class, CS-169 , we implemented a toy version of Sun’s research operating system Spring OS . Several of my contemporaries in the CS Department went on to work at Sun, and developed or advanced many of the technologies that made Solaris great, like ZFS , dtrace , libumem , mdb , doors , and zones . The Solaris Linkers and Libraries Guide remains one of the best ways to develop an understanding of shared library internals. The first startup I worked for developed on Solaris x86, because the team knew Solaris well. Today, many of my co-workers on the server engineering team here at MongoDB share that formative experience with Solaris. We have a great deal of collective nostalgia and appreciation for Solaris and the amazing engineering effort that went into its development. So it is, for many of us at MongoDB, bittersweet to announce that MongoDB is terminating support for Solaris. Effective immediately, we plan to cease production of new builds of MongoDB for Solaris, across all supported versions of MongoDB. Existing release artifacts for Solaris will continue to be made available, but no new releases will be issued, barring a critical issue raised under an existing support contract covering MongoDB versions 3.0 through 3.4 running on Solaris. We will continue to fix critical flaws for the community, regardless of where found or how reported. Anyone can report a security vulnerability by using our Security project to create an account, then a ticket, describing the vulnerability. This was not an easy decision for us to make, and we feel that it is important to provide some background on why we have made what may seem at first to be a capricious decision. The principal reason for us to drop Solaris support is simply a lack of adoption among our user base. Of our commercial users, we knew of only a handful who had ever been running on Solaris, and all confirmed that they had migrated away, or were in the process of doing so. Our download numbers for our Solaris builds confirmed this lack of interest, as did stats gathered from our managed operations tools — we find about 0.06% (and decreasing) of MongoDB users are running on Solaris. Additionally, we found that the cost to continue to support Solaris was very high, and that it was increasingly becoming an obstacle to developer productivity. Among the difficulties we experienced: The ecosystem is fragmented: You say you want to run on “Solaris”. OK, fine, but which one? Illumos? OpenSolaris? OpenIndiana? SmartOS? Oracle Solaris? OmniOS? Do we release for one of those? If so, which one? All? How do versions relate across them? What is the level of binary compatibility? Which versions do we need to support, on which variants? How do we certify that we support all of them? Linux suffers (or even benefits) from a similar profusion of flavors, but we have a significant population of users on each of the major different flavors, so it makes sense to explicitly support them all. Not so for the numerous Solaris variants. Our development tools work poorly on Solaris: Clang doesn’t support Solaris at all, as far as we can tell. Golang doesn’t consider it a first-class platform. GDB seems unable to handle the simplest of tasks when confronted with threads on Solaris. GCC appears to not fully support important C++11 features like the thread_local keyword, due to missing support in the Solaris C library, at least on the versions we used. Obviously, all of those could be fixed if there were a pressing commercial upside to doing so, but we don’t see that to be the case. Lack of developer familiarity: While several of our senior developers know their way around Solaris well, our junior devs have never touched it. Investing in teaching them is of questionable value. Sometimes, though we try hard to scrub them out, we find that some tests are flaky. Sometimes a failure exhibits on Solaris, but the issue isn’t specific to that platform. Should we be investing the time of our most seasoned engineers to track these down to prove that they aren’t Solaris specific? Operational difficulties: Most of our CI testing is done in AWS. On multiple occasions, our Solaris images have simply stopped working, and repairing them took significant engineering work. The most recent outage was particularly troublesome and time consuming. The engineering effort required to sustain the platform does not seem warranted. The future of Oracle Solaris , perhaps the one true Solaris if you had to pick one, is murky at best. While no single one of these issues seems sufficient on its own to argue for terminating support for Solaris, when combined with the observed lack of interest or use, it makes for a compelling case. We would rather invest our time and effort developing for the platforms that our users actually use. So, with some real sadness and fond memories, we have decided to say goodbye to Solaris. We will miss you, Solaris, but it is time we parted ways.

August 29, 2017
Engineering Blog

Ready to get Started with MongoDB Atlas?

Start Free