Integrating Support for Non-Volatile Memory Into WiredTiger
Intel Optane DC Persistent Memory is a non-volatile memory (NVRAM) product that resembles both storage and memory and can be used as either. Like storage, Optane NVRAM retains data after a crash or power outage. Like memory, it sits on the memory bus and can be accessed by CPU using load/store instructions. In certain scenarios, its access latency even approaches that of dynamic random access memory (DRAM). At MongoDB, we have been thinking about how to use NVRAM in the storage engine. It can be seen as an extension of volatile DRAM, but a denser and a cheaper one. In pursuit of this goal, we extended our storage engine, WiredTiger , with a volatile NVRAM cache that retains frequently used file blocks. In this article, we share our experience, describe the lessons learned, and evaluate the costs and benefits of this approach. How to use NVRAM in the storage stack Optane NVRAM can act as both storage and memory. The persistent memory fabric itself can be packaged as a solid-state drive (SSD), as in Optane SSDs, or as a dual-inline memory module (DIMM) that looks almost like its DRAM counterpart and lives in the same type of slot on the motherboard. Even when NVRAM is packaged as a non-volatile DIMM (NVDIMM), we can ask the operating system to present it as a block device, put a file system on top, and use it just like regular storage. Broadly speaking, there are three ways to use NVRAM: As regular storage As persistent memory As an extension to volatile memory NVRAM as storage Using NVRAM as regular storage can deliver superior throughput (compared to SSD) for read-dominant workloads, but this approach hinders write-dominant workloads because of Optane NVRAM’s limited write throughput (see the section “Performance properties of Optane NVRAM”). In any case, both the price and density of NVRAM are closer to those of DRAM than to those of SSD, so using it as storage is not recommended. NVRAM as persistent memory Imagine that all your data structures live in memory and that you never have to worry about saving them to files. They are just there, even after you quit your application or if it suffers a crash. Although this setup sounds simple, in practice, it is still challenging to program for this model. If your system crashes and you would like to be able to find your data after restart, you need to name it. A variable name is not sufficient, because it is not unique; thus, you have to restructure your code to make sure your data has persistent identifiers. Persistent Memory Development Kit (PMDK) provides APIs for that. A more difficult problem is surviving a crash. Your program may crash in the middle of a logical operation on a data structure. For example, suppose you are inserting an item into a linked list, and you have set the source pointer, but the crash occurs before setting the destination pointer. Upon restart, you’ll end up with corrupted data. To make matters worse, even if the logical operation had completed before the crash, the data might have been written only to CPU caches but not persisted to the memory itself . One solution is to wrap memory operations in transactions; however, programming transactional memory is notoriously difficult. Another solution is to use prepackaged data structures and APIs, but if you are looking to create your own highly optimized data structures, you must implement your own logging and recovery or other mechanisms that protect your data similarly to transactions. NVRAM as an extension of volatile memory Somewhat counterintuitively, this option involves disregarding the persistence of NVRAM and using it as a volatile extension of DRAM. Why would you want to do that? Suppose you have a fixed budget to buy extra memory for your system. You can either afford N GB of DRAM or about M*N GB of NVRAM — that’s because NVRAM is denser and cheaper per byte than DRAM (about three times cheaper, at the time of writing). Depending on your application, you might be better off in terms of performance/$$ if you buy additional NVRAM, as opposed to DRAM. In support of this use case, Intel provides a hardware mechanism, called Memory Mode, which treats NVRAM as “regular” system memory and uses DRAM as its cache. In other words, the hardware will do its best to place frequently used data structures in DRAM, and the rest will reside in NVRAM. The beauty of this mechanism is that it requires absolutely no changes to applications. The downside is that it may perform worse than a custom solution for certain workloads (see section “How NVCache affects performance”). Our solution is a custom-built volatile cache that resides in NVRAM. Our architecture Our NVRAM cache (or NVCache) is a component of the MongoDB storage engine WiredTiger. For persistent storage, WiredTiger organizes data into blocks, where keys and values are efficiently encoded and (optionally) compressed and encrypted. For fast query of its B+-tree data structure, WiredTiger transforms blocks into pages, where keys/values are decoded and indexed. It keeps pages in its DRAM page cache. Figure 1. The architecture of NVCache. Figure 1 shows the architecture of NVCache. NVCache is the new component, and the rest are part of WiredTiger. NVCache sits next to the block manager, which is the code responsible for reading/writing the data from/to persistent storage. Let’s look at each path in turn. Read path: If the page cache cannot locate the searched-for data, it issues a read to the block manager (1). The block manager checks whether the block is present in NVCache (2), accesses it from NVCache if it is (3), and reads it from disk if it is not (4). The block manager then transforms the block into a page, decrypting and decompressing it if needed, and then hands it over to the page cache (5). It also notifies NVCache that it has read a new block, and NVCache then has the discretion to accept it (6). NVCache stores the blocks in the same format as they are stored on disk (e.g., compressed or encrypted if those configuration options were chosen). Write path: The write path differs from the read path in that WiredTiger does not modify disk blocks in place. It writes updates into in-memory data structures and then converts them into new pages, which would be sent to disk either during eviction from the page cache or during a checkpoint (7). When the block manager receives a new page, it converts it into a new block, writes the block to storage (8), and informs NVCache (9). NVCache then has the discretion to accept it. Obsolete blocks are eventually freed, at which time the block manager instructs NVCache to invalidate cached copies (10). To avoid running out of space, NVCache periodically evicts less-used blocks. The eviction thread runs once a second. Overall, this design is straightforward, but making it performant was a challenge. As expected with brand new storage or memory devices, the software must cater to their unique performance properties. In the next section, we focus on these performance features and explain how we adapted our cache to play along. Performance properties of Optane NVRAM In low-bandwidth scenarios, the access latency of Optane NVRAM approaches that of DRAM. A small read takes about 160 to 300 nanoseconds, depending on whether it is part of a sequential or a random access pattern1; a read from DRAM takes about 90 nanoseconds.3 Small writes are as fast as in DRAM3 because the data only has to reach the memory controller, where it will be automatically persisted in case of a power loss. In high-bandwidth scenarios, we usually look at throughput. Sequential read throughput is about 6 GB/s for a single NVDIMM 1,2 and scales linearly as you add more memory modules. (A single 2nd Generation Intel Xeon Scalable processor can support up to six NVDIMMs.) The write throughput is more limited: We observed up to 0.6 GB/s on a single NVDIMM2, and others observed up to 2.3 GB/s. 3 Again, if your workload writes to different NVDIMMs, the throughput will scale with the number of modules in your system. A somewhat troublesome observation about write throughput is that it scales negatively as you add more threads. Write throughput peaks at one or two concurrent threads and then drops as more threads are added. 2,3 More importantly, we were surprised to find that, on Optane NVRAM, the presence of writers disproportionately affects the throughput of readers. Figure 2. Read throughput in presence of concurrent writer threads. Figure 2 shows how the throughput of eight reader threads drops as more concurrent writers are added. Although this effect is present on both DRAM and NVRAM (and certainly on other storage devices), on Optane NVRAM, the effect is much more pronounced. Performance of reads will suffer in the presence of writes. This important observation drove the design of our NVCache. Throttling writes in caches for Optane NVRam For a cache to be useful, it must contain popular data. The duties of admitting fresh data and expunging the old fall on cache admission and eviction policies, respectively. Both admission and eviction generate writes, and, because writes hurt the performance of reads on Optane, admission and eviction will interfere with performance of cache retrievals (which involve reads). Thus, we have a trade-off: On one hand, admission and eviction are crucial to making the cache useful. On the other hand, the write operations that they generate will hamper the performance of data retrievals, thereby making cache less performant. To resolve this tension, we introduced the Overhead Bypass (OBP) metric, which is a ratio of reads and writes applied to the cache. Keeping this ratio under a threshold allowed us to limit the overhead of writes: OBP = (blocks_inserted + blocks_deleted) / blocks_looked_up Intuitively, blocks_looked_up correlates with the benefit of using the cache, whereas the sum of blocks_inserted and blocks_deleted correlates with the cost. NVCache throttles admission and eviction to keep this ratio under 10%. (Our source code is available in the WiredTiger public GitHub repository .) Without OBP, the sheer overhead of data admission and eviction was quite substantial. To measure this overhead in its purest form, we experimented with workloads that do not stand to benefit from any extra caching, such as those with small datasets that fit into the OS buffer cache (in DRAM) or those that perform so many writes that they quickly invalidate any cached data. We found that using NVCache without the OBP feature caused these workloads to run up to two times slower than without the cache. Introducing the OBP completely eliminated the overhead and enabled the workloads that stand to benefit from extra caching to enjoy better performance. How NVCache affects performance In this section, we’ll look in detail at the performance of workloads with large datasets that stand to benefit from an additional cache. Experimental system: The following experiments were performed on a Lenovo ThinkSystem SR360 with two Intel Xeon Gold 5218 CPUs. Each CPU has 16 hyper-threaded cores. The system has two Intel Optane persistent memory modules of 126 GB each. For storage, we used an Intel Optane P4800X SSD. We configured our system with only 32 GB of DRAM to make sure that extra memory in the form of NVRAM would be called for. We present the data with widely used YCSB benchmarks 4,5 (Table 1), although we also performed analysis with our in-house benchmarks and reached similar conclusions. Table 1. Characteristics of YCSB benchmarks The following charts compare the throughput of YCSB with NVCache, with Intel Memory Mode (MM), and with OpenCAS6 — a kernel implementation of NVRAM-resident cache from Intel. OpenCAS was configured in the write-around mode, which was the best option for limiting the harmful effect of writes.7 Figures 3a-c shows the data in configurations using 63 GB, 126 GB, and 252 GB of NVRAM, respectively. Figure 3. Throughput of YCSB under Memory Mode (MM), OpenCAS, and NVCache relative to running with DRAM only. We make the following three observations: OpenCAS cache delivers no performance benefit from extra NVRAM. It achieves a similar or better read hit rate as the NVCache but also makes two orders of magnitude more writes to NVRAM, probably because it does not throttle the rate of admission. Writes interfere with performance of reads, which is probably why this cache delivers no performance benefits. When the dataset size exceeds NVRAM capacity, NVCache provides substantially better performance than Memory Mode. As shown in Figure 3a, NVCache outperforms the memory mode by between 30% (for YCSB-B) and 169% (for YCSB-C). Furthermore, the memory mode hurts YCSB-A’s update throughput by about 18% relative to the DRAM-only baseline, while NVCache does not. Memory mode performs comparably to NVCache when NVRAM is ample. With 252 GB of NVRAM, all datasets comfortably fit into the NVRAM. Two factors explain why NVCache loses its edge over MM with ample NVRAM: (1) For NVCache, the marginal utility of additional NVRAM is small after 126 GB; NVCache hit rate grows by about 20% when we increase NVRAM size from 63 GB to 126 GB, but only by another 5% if we increase it from 126 GB to 252 GB. (2) While MM allows the kernel buffer cache to expand into NVRAM, NVCache confines it to DRAM, which is also used by the WiredTiger’s page cache. Contention for DRAM limits performance. Overall, the benefit of a custom NVRAM cache solution is that it provides better performance than the Memory Mode for large workloads. The disadvantage is that it requires new software, whereas MM can be used without any changes to applications. Performance and cost In this section, we explore the trade-offs of using Optane NVRAM as a volatile extension of DRAM versus just using more DRAM. To that end, we take a fixed memory budget of 96 GB and vary the fraction satisfied by DRAM and NVRAM as shown in Table 2. Table 2. Budget of memory configurations containing both DRAM and NVRAM relative to DRAM-only. We use the NVRAM-to-DRAM price ration of 0.38. 8 Figure 4. Performance per dollar as the amount of NVRAM increases and the amount of DRAM decreases (in YCSB workloads). Figure 4 shows the performance of YCSB under these configurations normalized to using 96 GB DRAM and divided by the cost ratio in column 3. In other words, these are performance/$ numbers relative to the DRAM-only configuration. In these experiments, we used only NVCache to manage NVRAM, as it performed comparably to or better than other options. Positive numbers mean that the performance decreased less than the memory cost. Read-only or read-mostly workloads that benefit from the NVCache experience a positive gain, as expected. Although in most cases performance predictably drops as the amount of DRAM decreases, YCSB-C in configuration with 64 GB NVRAM and 32 GB DRAM performs better than it does with 96 GB DRAM — so we decrease the system cost and improve performance in absolute terms. This occurs because beyond 32 GB of DRAM, the utility of additional memory (and a larger page cache) is considerably smaller than the loss in performance due to a smaller NVCache. YCSB-A, whose write intensity prevents it from deriving benefits of any additional caching, suffers the overall loss in terms of performance/$. Its performance drops at a steeper rate than the memory cost as we decrease the amount of DRAM. We conclude that NVRAM is a cost-effective method of reducing memory cost while balancing the impact on performance for read-dominant workloads. At the same time, even a modest presence of writes can render NVRAM unprofitable relative to DRAM. References J. Izraelevitz, et al. Basic Performance Measurements of the Intel Optane DC Persistent Memory Module . arXiv:1903.05714. We Replaced an SSD with Storage Class Memory. Here is What We Learned by Sasha Fedorova. The MongoDB Engineering Journal. Jian Yang, et al. An Empirical Guide to the Behavior and Use of Scalable Persistent Memory . USENIX File Access and Storage Conference (FAST 2020) . Yahoo! Cloud Serving Benchmark, Git Repo . B.F. Cooper, et al. Benchmarking Cloud Serving Systems with YCSB . SoCC '10: Proceedings of the 1st ACM Symposium on Cloud Computing . Open Cache Acceleration Software . Open CAS Linux — Admin Guide . H.T. Kassa, et al. Improving Performance of Flash Based Key-value Stores Using Storage Class Memory as a Volatile Memory Extension . USENIX Annual Technical Conference (USENIX ATC 21) .
We Replaced an SSD with Storage Class Memory; Here is What We Learned
On April 2, 2019 Intel Optane Persistent Memory became the first commercially available storage class memory (SCM) product. Like SSD, this memory is persistent, and like DRAM it sits on the memory bus. Long before a commercial release, system architects pondered how exactly SCM fits in the storage hierarchy, and now came an opportunity to perform concrete measurements. One question we wanted to answer is whether a storage device sitting on a memory bus can deliver better throughput than an equivalent storage device sitting on a PCI-e. There are two ways to access SCM: byte interface and block interface. Byte interface allows using load and store instructions, just like with DRAM. Block interface exposes SCM as a block device, optionally with a file system on top: this way it can be accessed just like a conventional SSD. The load/store API is streamlined, because nothing stands between the application and the hardware, but also tricky to use, because it does not come with features like crash consistency, the way file system API usually does. Accessing SCM via the block or file system API comes with OS overhead, but there is no need to rewrite applications. WiredTiger, MongoDB’s storage engine that we evaluated in this article, reads and writes data to/from storage using sizeable blocks (typically 4KB or larger). Besides being a necessity on conventional storage hardware today, using block API has other practical advantages. For example, compression and encryption, features that customers covet, are optimized to work with blocks. Similarly, checksums that safeguard from data corruption are computed on blocks of data. Furthermore, WiredTiger caches blocks of data in its DRAM-resident cache, which together with the OS buffer cache is a boon to performance. Block-orientedness and reliance on caching positioned WiredTiger, like many other storage engines, to effectively hide latency of slow storage devices . As a result, our experiments revealed that a modest latency advantage that SCM provides over a technology-equivalent SSD does not translate into performance advantages for realistic workloads. The storage engine effectively masks these latency differences. SCM will shine when it is used for latency-sensitive operations that cannot be hidden with batching and caching, such as logging. In the rest of the article we detail the results of our experiments that lead us to make this conclusion. Experimental Platform We experimented with two storage devices: Intel Optane DC Persistent Memory (PM) and Intel Optane P4800X SSD . Both are built with the Intel Optane 3D XPoint non-volatile memory, but the former is a SCM that sits on the memory bus while the latter is a PCIe-attached SSD. Microbenchmarks To begin with, we gauged raw device bandwidth with microbenchmarks that read or write a 32GB file using 8KB blocks. We vary the number of threads simultaneously accessing the file, each its own chunk. A file can be accessed either via system calls (read/write) or mmap; the latter method usually has less overhead . SSD drive's raw performance meets the spec. According to the spec , our P48000X drive is capable of up to 2.5GB/s sequential read bandwidth and up to 2.2GB/s sequential write bandwidth. Here are the numbers we observed via the Ubuntu Linux (5.3 kernel) raw device API, meaning that the data bypasses the OS buffer cache. Raw SSD performance, sequential reads and writes The read bandwidth behaves according to the specification as long as we use at least two threads. The write bandwidth, unexpectedly, exceeds its specified upper bound when using multiple threads. We suspect that this could be due to buffering writes either at the OS or at the device. The Optane P4800X SSD is faster than a typical SSD at the time of this writing, but not to the point of being incomparable. While the Optane SSD offers up to a 2.5GB/s of sequential read bandwidth, a typical NAND SSD (e.g., Intel SSD Pro 6000p series) offers up to 1.8GB/s. The difference is more noticeable with writes. The Optane drive can deliver up to 2.2 GB/s, while the NAND drive can do no more than 0.56 GB/s. Random reads and writes on the Optane SSD are not that much worse than sequential ones. We are able to achieve (using mmap) close to peak sequential throughput with reads and only 10% short of peak sequential throughput for writes. Near-identical performance of sequential and random access is a known feature of these drives . SCM offers high-bandwidth reads, but low-bandwidth writes Now let us look at the raw performance of SCM. Intel systems supporting Optane PM can fit up to six DIMMs; our experimental system had only two. We measured the throughput on a single DIMM and two DIMMs used together, to extrapolate scaling as the number of DIMMs increases. We also relied on data from other researchers to confirm our extrapolation. There are two ways to obtain direct access to PM: (1) devdax — a PM module is exposed as a character device and (2) fsdax — in this case we have a file system on top of a PM module masquerading as a block device, but file accesses bypass the buffer cache via the Direct Access (DAX) mode. In our experiments we used the ext4 file system. The following chart shows the throughput of sequential reads and writes obtained via these access methods. In all cases we use mmap, because that is the only method supported by devdax. Sequential read bandwidth of a single PM module reaches about 6.4 GB/s; that matches observations of other researchers . Random access behaves almost identically to sequential access, so we omit the charts. Storage class memory, sequential reads, single PMEM device. Storage class memory, sequential writes, single PMEM device. Write experiments tell a different story. The single-module write throughput achieves a mere 0.6 GB/s. This measurement does not agree with the data from the UCSD researchers who observed around 2.3GB/s write bandwidth on a single device . Further investigation led us to believe that this was due to differences in hardware versions. That said, our observations reveal that a single PM module achieves write throughput comparable only to a NAND SSD. Next, let’s look at scaling across two devices. The following figure shows the measurements for sequential reads and writes, using mmap over fsdax. We used the Linux striped device mapper to spread the load across two DIMMs. For reads, with two DIMMs we can almost double the peak read bandwidth, from 6.4 GB/s with one DIMM to 12.4 GB/s with two. Similarly, researchers at UCSD observed nearly linear scaling across six DIMMs . Storage class memory, sequential reads, comparison between one and two PMEM devices. Storage class memory, sequential writes, comparison between one and two PMEM devices. For writes, we achieve nearly 1 GB/s of write throughput with two DIMMs relative to 0.6 GB/s with one, but the scaling is less than linear if we can extrapolate from a single data point. The USCD researchers observed that bandwidth with six DIMMs improved by 5.6x relative to using a single DIMM, which is in line with our observation. Extrapolating from these data points, if our system had six DIMMs, we’d observe around 3.4 GB/s of peak write bandwidth, which is about 50% better than the Optane SSD. In summary, with bare device access we see about 2.5 GB/s of peak read bandwidth on the Optane SSD and about 6.4 GB/s on a single Optane PM module. With six modules, the throughput would be ~38GB/s. Write throughput was only 0.6 GB/s on a single PM module, projected to reach 3.4 GB/s with six, while the Optane SSD reached 2.2 GB/s write bandwidth. Optane SCM has a significant edge over the SSD with respect to reads, and a small advantage in writes, provided you can afford six PM modules; otherwise, an SSD will deliver a higher write throughput. Software caching attenuates SCM performance advantage While SCM is closer to the speed of DRAM than traditional storage media, DRAM is still faster, so advantages of DRAM caching are difficult to overlook. The following charts shows that with the buffer cache on (here I am using ext4 without the DAX option), all devices perform roughly the same, regardless of whether we are doing reads or writes, random or sequential access. These experiments were done with the warm buffer cache, i.e., the file was already in the buffer cache before I began the experiment, so here we are measuring pure DRAM performance. With access to data taking less time, software overheads become more evident, which is why mmap is much faster than system calls if we use eight or more threads. Sequential reads on SSD and SCM with a warm buffer cache. Random reads on SSD and SCM with a warm buffer cache. Sequential writes on SSD and SCM with a warm buffer cache. Random writes on SSD and SCM with a warm buffer cache. If we begin each experiment with a cold buffer cache, the difference between the devices is still visible, but less apparent than if we bypass the buffer cache altogether. With cold buffer cache, on the read path the OS has to copy the data from the storage device into the buffer cache before making it available to the application (hence extra software overhead). Furthermore, with buffer cache on, the OS is not using huge pages. These factors dampen the raw read advantage of SCM. For writes, whereas SCM used to deliver lower bandwidth than SSD with raw access, now SCM outpaces SSD, likely because the buffer cache absorbs and batches some of them, instead of flushing each write to the device immediately. Sequential reads on SSD and SCM with a cold buffer cache. Random reads on SSD and SCM with a cold buffer cache. Sequential writes on SSD and SCM with a cold buffer cache. Random writes on SSD and SCM with a cold buffer cache. Experiments with the storage engine Like most storage engines, WiredTiger was designed and tuned to leverage a DRAM cache. Both WiredTiger internal cache and the OS buffer cache are crucial for performance in all workloads we measured. Running WiredTiger without the OS buffer cache (fsdax mode) reduced its performance by up to 30x in our experiments; hence, we did not use the direct-access mode. We run the WiredTiger’s wtperf benchmark suite, which was designed to stress various parts of the system and emulate typical workloads observed in practice. WiredTiger internal cache size varies between a few to a hundred gigabytes across the benchmarks, and most benchmarks use at least a dozen threads. (Detailed configuration parameters for all benchmarks can be viewed here .) There is no locking in WiredTiger on the common path, so thread-level concurrency usually translates into high CPU utilization and concurrent I/O. As described in our previous blog post we added a feature to WiredTiger to use mmap for I/O instead of system calls. The following chart shows the performance of the wtperf suite on Intel Optane SCM (one and two modules) and on the Optane SSD. We see no significant performance difference between SCM and SSD. Apart from one write-intensive benchmark evict-btree-1, which is faster on SSD, there are no statistically significant differences between the two. Using a dual-module SCM over a single-module SCM gives no performance advantage either. While SCM has a higher bandwidth than a technology-equivalent SSD, the advantage is within an order of magnitude, and, turns out, that effective DRAM caching hides that difference. Latency, and not bandwidth, is where SCM can shine. In contrast to bandwidth, the latency of reading a block of data from an Optane PM is two orders of magnitude shorter than reading it from an Optane SSD: 1 microsecond vs 100-200 microseconds. The most obvious place in a storage engine where latency could be the bottleneck is logging, and academic literature is ripe with successful examples of using Optane PM for logging. In the meantime, stay tuned for our next report on exploring persistent memory. WiredTiger benchmarks on SSD and SCM. Group 1. WiredTiger benchmarks on SSD and SCM. Group 2. WiredTiger benchmarks on SSD and SCM. Group 3. If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.
Getting Storage Engines Ready for Fast Storage Devices
Over the past two decades, performance of storage hardware increased by two orders of magnitude. First, with the introduction of solid state drives (SSD), then with the transition from SATA to PCIe, and finally with the innovation in non-volatile memory technology and the manufacturing process [ 1 , 7 ]. More recently, in April 2019, Intel released the first commercial Storage Class Memory (SCM). Its Optane DC Persistent Memory, built with 3D XPoint technology, sits on a memory bus and further reduces I/O latency [ 2 ]. While device access used to dominate I/O latency, the cost of navigating the software stack of a storage system is becoming more prominent as devices’ access time shrinks. This is resulting in a flurry of academic research and in changes to commercially used operating systems (OS) and file systems. Despite these efforts, mainstream system software is failing to keep up with rapidly evolving hardware. Studies [ 4 , 5 , 6 ] have shown that file system and other OS overhead still dominates the cost of I/O in very fast storage devices, such as SCMs. In response to these challenges, academics proposed a new user-level file system, SplitFS [ 6 ], that substantially reduces these overheads. Unfortunately, adopting a user-level file system is not a viable option for many commercial products. Apart from concerns about correctness, stability, and maintenance, adoption of SplitFS would restrict portability, as it only runs on Linux and only on top of the ext4-DAX file system. Fortunately, there IS something that can be done in software storage engines that care about I/O performance. Within MongoDB’s storage engine, WiredTiger, we were able to essentially remove the brakes that the file system applied to our performance without sacrificing the convenience it provides or losing portability. Our changes rely on using memory-mapped files for I/O and batching expensive file system operations. These changes resulted in up to 63% performance improvements for 19 out of 65 benchmarks on mainstream SSDs. Streamlining I/O in WiredTiger Our changes to WiredTiger were inspired by a study from UCSD [ 4 ], where the authors demonstrated that by using memory-mapped files for I/O and by pre-allocating some extra space in the file whenever it needed to grow, they could achieve almost the same performance as if the file system was completely absent. Memory-mapped files Memory-mapped files work as follows. The application makes an mmap system call, whereby it requests the operating system to “map” a chunk of its virtual address space to a same-sized chunk in the file of its choice (Step 1 in Fig.1). When it accesses memory in that portion of the virtual address space for the first time (e.g., virtual page 0xABCD in Fig. 1), the following events take place: Since this is a virtual address that has not been accessed before, the hardware will generate a trap and transfer control to the operating system. The operating system will determine that this is a valid virtual address, ask the file system to read the corresponding page-sized part of the file into its buffer cache, and Create a page table entry mapping the user virtual page to the physical page in the buffer cache (e.g., physical page 0xFEDC in Fig.1), where that part of the file resides (Step 2 in Fig 1). Finally, the virtual-to-physical translation will be inserted into the Translation Lookaside Buffer (TLB -- a hardware cache for these translations), and the application will proceed with the data access. Memory mapped files work as follows: (1) They establish a virtual memory area for the mapped file, (2) They place the virtual-to-physical address translation into the page table, (3) They cache the translation in the Translation Lookaside Buffer (TLB) Subsequent accesses to the same virtual page may or may not require operating system involvement, depending on the following: If the physical page containing the file data is still in the buffer cache and the page table entry is in the TLB, operating system involvement is NOT necessary, and the data will be accessed using regular load or store instructions. If the page containing the file data is still in the buffer cache, but the TLB entry was evicted, the hardware will transition into kernel mode, walk the page table to find the entry (assuming x86 architecture), install it into the TLB and then let the software access the data using regular load or store instructions. If the page containing the file data is not in the buffer cache, the hardware will trap into the OS, which will ask the file system to fetch the page, set up the page table entry, and proceed as in scenario 2. In contrast, system calls cross the user/kernel boundary every time we need to access a file. Even though memory-mapped I/O also crosses the user/kernel boundary in the second and third scenarios described above, the path it takes through the system stack is more efficient than that taken by system calls. Dispatching and returning from a system call adds CPU overhead that memory-mapped I/O does not have [ 8 ]. Furthermore, if the data is copied from the memory mapped file area to another application buffer, it would typically use a highly optimized AVX-based implementation of memcpy. When the data is copied from the kernel space into the user space via a system call, the kernel has to use a less efficient implementation, because the kernel does not use AVX registers [ 8 ]. Pre-allocating file space Memory-mapped files allow us to substantially reduce the involvement of the OS and the file system when accessing a fixed-sized file. If the file grows, however, we do need to involve the file system. The file system will update the file metadata to indicate its new size and ensure that these updates survive crashes. Ensuring crash consistency is especially expensive, because each journal record must be persisted to storage to make sure it is not lost in the event of a crash. If we grow a file piecemeal, we incur that overhead quite often. That is why the authors of SplitFS [ 6 ] and the authors of the UCSD study [ 4 ] both pre-allocate a large chunk of the file when an application extends it. In essence, this strategy batches file system operations to reduce their overhead. Our Implementation The team applied these ideas to WiredTiger in two phases. First, we implemented the design where the size of the mapped file area never changes. Then, after making sure that this simple design works and yields performance improvements, we added the feature of remapping files as they grow or shrink. That feature required efficient inter-thread synchronization and was the trickiest part of the whole design -- we highlight it later in this section. Our changes have been in testing in the develop branch of WiredTiger as of January 2020. As of the time of the writing, these changes are only for POSIX systems; a Windows port is planned for the future. Assuming a fixed-size mapped file area Implementing this part required few code changes. WiredTiger provides wrappers for all file-related operations, so we only needed to modify those wrappers. Upon opening the file, we issue the mmap system call to also map it into the virtual address space. Subsequent calls to wrappers that read or write the file will copy the desired part of the file from the mapped area into the supplied buffer. WiredTiger allows three ways to grow or shrink the size of the file. The file can grow explicitly via a fallocate system call (or its equivalent), it can grow implicitly if the engine writes to the file beyond its boundary, or the file can shrink via the truncate system call. In our preliminary design we disallowed explicitly growing or shrinking the file, which did not affect the correctness of the engine. If the engine writes to the file beyond the mapped area, our wrapper functions simply default to using system calls. If the engine then reads the part of the file that had not been mapped, we also resort to using a system call. While this implementation was decent as an early prototype, it was too limiting for a production system. Resizing the mapped file area The trickiest part of this feature is synchronization. Imagine the following scenario involving two threads, one of which is reading the file and another one truncating it. Prior to reading, the first thread would do the checks on the mapped buffer to ensure that the offset from which it reads is within the mapped buffer’s boundaries. Assuming that it is, it would proceed to copy the data from the mapped buffer. However, if the second thread intervenes just before the copy and truncates the file so that its new size is smaller than the offset from which the first thread reads, the first thread’s attempt to copy the data would result in a crash. This is because the mapped buffer is larger than the file after truncation and attempting to copy data from the part of the buffer that extends beyond the end of the file would generate a segmentation fault. An obvious way to prevent this problem is to acquire a lock every time we need to access the file or change its size. Unfortunately, this would serialize I/O and could severely limit performance. Instead, we use a lock-free synchronization protocol inspired by read-copy-update (RCU) [ 9 ]. We will refer to all threads that might change the size of the file as writers. A writer, therefore, is any thread that writes beyond the end of the file, extends it via a fallocate system call, or truncates it. A reader is any thread that reads the file. Our solution works as follows: A writer first performs the operation that changes the size of the file and then remaps the file into the virtual address space. During this time we want nobody else accessing the mapped buffer, neither readers nor writers. However, it is not necessary to prevent all I/O from occurring at this time; we can simply route I/O to system calls while the writer is manipulating the mapped buffer, since system calls are properly synchronized in the kernel with other file operations. To achieve these goals without locking, we rely on two variables: mmap_resizing: when a writer wants to indicate to others that it is about to exclusively manipulate the mapped buffer, it atomically sets this flag. mmap_use_count: a reader increments this counter prior to using the mapped buffer, and decrements it when it is done. So this counter tells us if anyone is currently using the buffer. The writer waits until this counter goes to zero before proceeding. Before resizing the file and the mapped buffer, writers execute the function prepare_remap_resize_file ; its pseudocode is shown below. Essentially, the writer efficiently waits until no one else is resizing the buffer, then sets the resizing flag to claim exclusive rights to the operation. Then, it waits until all the readers are done using the buffer. prepare_remap_resize_file: wait: /* wait until no one else is resizing the file */ while (mmap_resizing != 0) spin_backoff(...); /* Atomically set the resizing flag, if this fails retry. */ result = cas(mmap_resizing, 1, …); if (result) goto wait; /* Now that we set the resizing flag, wait for all readers to finish using the buffer */ while (mmap_use_count > 0) spin_backoff(...); After executing prepare_remap_resize_file , the writer performs the file-resizing operation, unmaps the buffer, remaps it with the new size and resets the resizing flag. The synchronization performed by the readers is shown in the pseudocode of the function read_mmap : read_mmap: /* Atomically increment the reference counter, * so no one unmaps the buffer while we use it. */ atomic_add(mmap_use_count, 1); /* If the buffer is being resized, use the system call instead of the mapped buffer. */ if (mmap_resizing) atomic_decr(mmap_use_count, 1); read_syscall(...); else memcpy(dst_buffer, mapped_buffer, …); atomic_decr(mmap_use_count, 1); As a side note, threads writing the file must perform both the reader synchronization, as in read_mmap, to see if they can use the memory-mapped buffer for I/O, and the writer synchronization in the case they are writing past the end of the file (hence extending its size). Please refer to the WiredTiger develop branch for the complete source code. Batching file system operations As we mentioned earlier, a crucial finding of the UCSD study that inspired our design [ 4 ], was the need to batch expensive file system operations by pre-allocating file space in large chunks. Our experiments with WiredTiger showed that it already uses this strategy to some extent. We ran experiments comparing two configurations: (1) In the default configuration WiredTiger uses the fallocate system call to grow files. (2) In the restricted configuration WiredTiger is not allowed to use fallocate and thus resorts to implicitly growing files by writing past their end. We measured the number of file system invocations in both cases and found that it was at least an order of magnitude smaller in the default configuration than in the restricted. This tells us that WiredTiger already batches file system operations. Investigating whether batching can be optimized for further performance gains is planned for the future. Performance To measure the impact of our changes, we compared the performance of the mmap branch and the develop branch on the WiredTiger benchmark suite WTPERF. WTPERF is a configurable benchmarking tool that can emulate various data layouts, schemas, and access patterns while supporting all kinds of database configurations. Out of 65 workloads, the mmap branch improved performance for 19. Performance of the remaining workloads either remained unchanged or showed insignificant changes (within two standard deviations of the average). Variance in performance of two workloads (those that update a log-structured merge tree) increased by a few percent, but apart from these, we did not observe any downsides to using mmap. The figures below show the performance improvement, in percent, of the mmap branch relative to develop for the 19 benchmarks where mmap made a difference. The experiments were run on a system with an Intel Xeon processor E5-2620 v4 (eight cores), 64GB of RAM and an Intel Pro 6000p series 512GB SSD drive. We used default settings for all the benchmarks and ran each at least three times to ensure the results are statistically significant. All but 2 of the benchmarks where mmap made a difference show significant improvements Overall, there are substantial performance improvements for these workloads, but there are a couple interesting exceptions. For 500m-btree-50r50u and for update-btree some operations (e.g., updates or inserts) are a bit slower with mmap, but others (typically reads) are substantially faster. It appears that some operations benefit from mmap at the expense of others; we are still investigating why this is happening. One of the variables that explains improved performance with mmap is increased rate of I/O. For example, for the 500m-btree-50r50u workload (this workload simulates a typical MongoDB load) the read I/O rate is about 30% higher with mmap than with system calls. This statistic does not explain everything: after all, read throughput for this workload is 63% better with mmap than with system calls. Most likely, the rest of the difference is due to more efficient code paths of memory-mapped I/O (as opposed to going through system calls), as observed in earlier work . Indeed, we typically observe a higher CPU utilization when using mmap. Conclusion Throughput and latency of storage devices improve at a higher rate than CPU speed thanks to radical innovations in storage technology and the placement of devices in the system. Faster storage devices reveal inefficiencies in the software stack. In our work we focussed on overhead related to system calls and file system access and showed how it can be navigated by employing memory-mapped I/O. Our changes in the WiredTiger storage engine yielded up to 63% improvement in read throughput. For more information on our implementation, we encourage you to take a look at the files os_fs.c and os_fallocate.c in the os_posix directory of the WiredTiger develop branch . References  List of Intel SSDs. https://en.wikipedia.org/wiki/List_of_Intel_SSDs  Optane DC Persistent Memory. https://www.intel.ca/content/www/ca/en/architecture-and-technology/optane-dc-persistent-memory.html  Linux® Storage System Analysis for e.MMC with Command Queuing, https://www.micron.com/-/media/client/global/documents/products/white-paper/linux_storage_system_analysis_emmc_command_queuing.pdf?la=en  Jian Xu, Juno Kim, Amirsaman Memaripour, and Steven Swanson. 2019. Finding and Fixing Performance Pathologies in Persistent Memory Software Stacks. In 2019 Architectural Support for Program- ming Languages and Operating Systems (ASPLOS ’19). http://cseweb.ucsd.edu/~juk146/papers/ASPLOS2019-APP.pdf  Jian Xu and Steven Swanson, NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories, 14th USENIX Conference on File and Storage Technologies (FAST’16). https://www.usenix.org/system/files/conference/fast16/fast16-papers-xu.pdf  Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim, Aasheesh Kolli, and Vijay Chidambaram. 2019. SplitFS: reducing software overhead in file systems for persistent memory. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19). https://www.cs.utexas.edu/~vijay/papers/sosp19-splitfs.pdf  SDD vs HDD. https://www.enterprisestorageforum.com/storage-hardware/ssd-vs-hdd.html  Why mmap is faster than system calls. https://medium.com/@sasha_f/why-mmap-is-faster-than-system-calls-24718e75ab37  Paul McKinney. What is RCU, fundamentally? https://lwn.net/Articles/262464/ If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.