March 17, 2020
Today I am announcing that my last day at MongoDB will be July 10th.
It’s been an incredible 12.5 years at MongoDB that, in a way, all goes back to my days as an undergraduate CS student in the early 2000s. While many of the topics in Brown University’s database systems class were applicable to any database, like b-trees and -- of course -- pointer swizzling, everything was ultimately presented as a building block for relational databases. Any other database style was relegated to the final, semester-ending lecture.
It made sense at the time to teach things that way. But in every software endeavor I worked on, from trivial weekend projects to multi-year enterprises, databases just kept getting in the way. Often maddeningly so.
After many years of partnering on multiple applications, Dwight Merriman and I set out to make a database we actually wanted to use, so we would never again have to deal with the tyranny of relational databases. We wanted something that was, at its core, born of the same assumptions that we, as developers, have about programming. Data should be stored the way you think about working with it. Databases should be able to scale horizontally and be spread globally like any other parts of your applications. Change is inevitable and iteration is more important than knowing answers ahead of time.
I’m proud to say we’ve absolutely accomplished what Dwight and I set out to do. Not only is MongoDB the database I’d always wished I had, but the database landscape itself is entirely transformed. No self-respecting database would be caught without a document solution, and distributed systems are becoming the new norm. Moreover, by pursuing new ideas, we helped spur more innovation in the database space in the past ten years than had been seen in decades. I predict that in another ten years, the document model and distributed databases will be the standard. Now I can look forward to tackling many new challenges, knowing that I won’t have to worry about the database getting in my way.
An outcome I’m equally proud of is that in the pursuit of that technical goal, we built a phenomenal engineering team with a strong commitment to collaboration, honesty, ambition, and ownership. The early days of me and Dwight moving fast and experimenting were great fun, but without this team’s incredible efforts, MongoDB would have remained a pipe dream, and I doubt we’d have seen the explosion of use cases in the market that we now take for granted.
I will miss building MongoDB, but I leave it confidently in the team’s brilliant, capable, and talented hands. I look forward to seeing what they will accomplish.
Getting Storage Engines Ready for Fast Storage Devices
Over the past two decades, performance of storage hardware increased by two orders of magnitude. First, with the introduction of solid state drives (SSD), then with the transition from SATA to PCIe, and finally with the innovation in non-volatile memory technology and the manufacturing process [ 1 , 7 ]. More recently, in April 2019, Intel released the first commercial Storage Class Memory (SCM). Its Optane DC Persistent Memory, built with 3D XPoint technology, sits on a memory bus and further reduces I/O latency [ 2 ]. While device access used to dominate I/O latency, the cost of navigating the software stack of a storage system is becoming more prominent as devices’ access time shrinks. This is resulting in a flurry of academic research and in changes to commercially used operating systems (OS) and file systems. Despite these efforts, mainstream system software is failing to keep up with rapidly evolving hardware. Studies [ 4 , 5 , 6 ] have shown that file system and other OS overhead still dominates the cost of I/O in very fast storage devices, such as SCMs. In response to these challenges, academics proposed a new user-level file system, SplitFS [ 6 ], that substantially reduces these overheads. Unfortunately, adopting a user-level file system is not a viable option for many commercial products. Apart from concerns about correctness, stability, and maintenance, adoption of SplitFS would restrict portability, as it only runs on Linux and only on top of the ext4-DAX file system. Fortunately, there IS something that can be done in software storage engines that care about I/O performance. Within MongoDB’s storage engine, WiredTiger, we were able to essentially remove the brakes that the file system applied to our performance without sacrificing the convenience it provides or losing portability. Our changes rely on using memory-mapped files for I/O and batching expensive file system operations. These changes resulted in up to 63% performance improvements for 19 out of 65 benchmarks on mainstream SSDs. Streamlining I/O in WiredTiger Our changes to WiredTiger were inspired by a study from UCSD [ 4 ], where the authors demonstrated that by using memory-mapped files for I/O and by pre-allocating some extra space in the file whenever it needed to grow, they could achieve almost the same performance as if the file system was completely absent. Memory-mapped files Memory-mapped files work as follows. The application makes an mmap system call, whereby it requests the operating system to “map” a chunk of its virtual address space to a same-sized chunk in the file of its choice (Step 1 in Fig.1). When it accesses memory in that portion of the virtual address space for the first time (e.g., virtual page 0xABCD in Fig. 1), the following events take place: Since this is a virtual address that has not been accessed before, the hardware will generate a trap and transfer control to the operating system. The operating system will determine that this is a valid virtual address, ask the file system to read the corresponding page-sized part of the file into its buffer cache, and Create a page table entry mapping the user virtual page to the physical page in the buffer cache (e.g., physical page 0xFEDC in Fig.1), where that part of the file resides (Step 2 in Fig 1). Finally, the virtual-to-physical translation will be inserted into the Translation Lookaside Buffer (TLB -- a hardware cache for these translations), and the application will proceed with the data access. Memory mapped files work as follows: (1) They establish a virtual memory area for the mapped file, (2) They place the virtual-to-physical address translation into the page table, (3) They cache the translation in the Translation Lookaside Buffer (TLB) Subsequent accesses to the same virtual page may or may not require operating system involvement, depending on the following: If the physical page containing the file data is still in the buffer cache and the page table entry is in the TLB, operating system involvement is NOT necessary, and the data will be accessed using regular load or store instructions. If the page containing the file data is still in the buffer cache, but the TLB entry was evicted, the hardware will transition into kernel mode, walk the page table to find the entry (assuming x86 architecture), install it into the TLB and then let the software access the data using regular load or store instructions. If the page containing the file data is not in the buffer cache, the hardware will trap into the OS, which will ask the file system to fetch the page, set up the page table entry, and proceed as in scenario 2. In contrast, system calls cross the user/kernel boundary every time we need to access a file. Even though memory-mapped I/O also crosses the user/kernel boundary in the second and third scenarios described above, the path it takes through the system stack is more efficient than that taken by system calls. Dispatching and returning from a system call adds CPU overhead that memory-mapped I/O does not have [ 8 ]. Furthermore, if the data is copied from the memory mapped file area to another application buffer, it would typically use a highly optimized AVX-based implementation of memcpy. When the data is copied from the kernel space into the user space via a system call, the kernel has to use a less efficient implementation, because the kernel does not use AVX registers [ 8 ]. Pre-allocating file space Memory-mapped files allow us to substantially reduce the involvement of the OS and the file system when accessing a fixed-sized file. If the file grows, however, we do need to involve the file system. The file system will update the file metadata to indicate its new size and ensure that these updates survive crashes. Ensuring crash consistency is especially expensive, because each journal record must be persisted to storage to make sure it is not lost in the event of a crash. If we grow a file piecemeal, we incur that overhead quite often. That is why the authors of SplitFS [ 6 ] and the authors of the UCSD study [ 4 ] both pre-allocate a large chunk of the file when an application extends it. In essence, this strategy batches file system operations to reduce their overhead. Our Implementation The team applied these ideas to WiredTiger in two phases. First, we implemented the design where the size of the mapped file area never changes. Then, after making sure that this simple design works and yields performance improvements, we added the feature of remapping files as they grow or shrink. That feature required efficient inter-thread synchronization and was the trickiest part of the whole design -- we highlight it later in this section. Our changes have been in testing in the develop branch of WiredTiger as of January 2020. As of the time of the writing, these changes are only for POSIX systems; a Windows port is planned for the future. Assuming a fixed-size mapped file area Implementing this part required few code changes. WiredTiger provides wrappers for all file-related operations, so we only needed to modify those wrappers. Upon opening the file, we issue the mmap system call to also map it into the virtual address space. Subsequent calls to wrappers that read or write the file will copy the desired part of the file from the mapped area into the supplied buffer. WiredTiger allows three ways to grow or shrink the size of the file. The file can grow explicitly via a fallocate system call (or its equivalent), it can grow implicitly if the engine writes to the file beyond its boundary, or the file can shrink via the truncate system call. In our preliminary design we disallowed explicitly growing or shrinking the file, which did not affect the correctness of the engine. If the engine writes to the file beyond the mapped area, our wrapper functions simply default to using system calls. If the engine then reads the part of the file that had not been mapped, we also resort to using a system call. While this implementation was decent as an early prototype, it was too limiting for a production system. Resizing the mapped file area The trickiest part of this feature is synchronization. Imagine the following scenario involving two threads, one of which is reading the file and another one truncating it. Prior to reading, the first thread would do the checks on the mapped buffer to ensure that the offset from which it reads is within the mapped buffer’s boundaries. Assuming that it is, it would proceed to copy the data from the mapped buffer. However, if the second thread intervenes just before the copy and truncates the file so that its new size is smaller than the offset from which the first thread reads, the first thread’s attempt to copy the data would result in a crash. This is because the mapped buffer is larger than the file after truncation and attempting to copy data from the part of the buffer that extends beyond the end of the file would generate a segmentation fault. An obvious way to prevent this problem is to acquire a lock every time we need to access the file or change its size. Unfortunately, this would serialize I/O and could severely limit performance. Instead, we use a lock-free synchronization protocol inspired by read-copy-update (RCU) [ 9 ]. We will refer to all threads that might change the size of the file as writers. A writer, therefore, is any thread that writes beyond the end of the file, extends it via a fallocate system call, or truncates it. A reader is any thread that reads the file. Our solution works as follows: A writer first performs the operation that changes the size of the file and then remaps the file into the virtual address space. During this time we want nobody else accessing the mapped buffer, neither readers nor writers. However, it is not necessary to prevent all I/O from occurring at this time; we can simply route I/O to system calls while the writer is manipulating the mapped buffer, since system calls are properly synchronized in the kernel with other file operations. To achieve these goals without locking, we rely on two variables: mmap_resizing: when a writer wants to indicate to others that it is about to exclusively manipulate the mapped buffer, it atomically sets this flag. mmap_use_count: a reader increments this counter prior to using the mapped buffer, and decrements it when it is done. So this counter tells us if anyone is currently using the buffer. The writer waits until this counter goes to zero before proceeding. Before resizing the file and the mapped buffer, writers execute the function prepare_remap_resize_file ; its pseudocode is shown below. Essentially, the writer efficiently waits until no one else is resizing the buffer, then sets the resizing flag to claim exclusive rights to the operation. Then, it waits until all the readers are done using the buffer. prepare_remap_resize_file: wait: /* wait until no one else is resizing the file */ while (mmap_resizing != 0) spin_backoff(...); /* Atomically set the resizing flag, if this fails retry. */ result = cas(mmap_resizing, 1, …); if (result) goto wait; /* Now that we set the resizing flag, wait for all readers to finish using the buffer */ while (mmap_use_count > 0) spin_backoff(...); After executing prepare_remap_resize_file , the writer performs the file-resizing operation, unmaps the buffer, remaps it with the new size and resets the resizing flag. The synchronization performed by the readers is shown in the pseudocode of the function read_mmap : read_mmap: /* Atomically increment the reference counter, * so no one unmaps the buffer while we use it. */ atomic_add(mmap_use_count, 1); /* If the buffer is being resized, use the system call instead of the mapped buffer. */ if (mmap_resizing) atomic_decr(mmap_use_count, 1); read_syscall(...); else memcpy(dst_buffer, mapped_buffer, …); atomic_decr(mmap_use_count, 1); As a side note, threads writing the file must perform both the reader synchronization, as in read_mmap, to see if they can use the memory-mapped buffer for I/O, and the writer synchronization in the case they are writing past the end of the file (hence extending its size). Please refer to the WiredTiger develop branch for the complete source code. Batching file system operations As we mentioned earlier, a crucial finding of the UCSD study that inspired our design [ 4 ], was the need to batch expensive file system operations by pre-allocating file space in large chunks. Our experiments with WiredTiger showed that it already uses this strategy to some extent. We ran experiments comparing two configurations: (1) In the default configuration WiredTiger uses the fallocate system call to grow files. (2) In the restricted configuration WiredTiger is not allowed to use fallocate and thus resorts to implicitly growing files by writing past their end. We measured the number of file system invocations in both cases and found that it was at least an order of magnitude smaller in the default configuration than in the restricted. This tells us that WiredTiger already batches file system operations. Investigating whether batching can be optimized for further performance gains is planned for the future. Performance To measure the impact of our changes, we compared the performance of the mmap branch and the develop branch on the WiredTiger benchmark suite WTPERF. WTPERF is a configurable benchmarking tool that can emulate various data layouts, schemas, and access patterns while supporting all kinds of database configurations. Out of 65 workloads, the mmap branch improved performance for 19. Performance of the remaining workloads either remained unchanged or showed insignificant changes (within two standard deviations of the average). Variance in performance of two workloads (those that update a log-structured merge tree) increased by a few percent, but apart from these, we did not observe any downsides to using mmap. The figures below show the performance improvement, in percent, of the mmap branch relative to develop for the 19 benchmarks where mmap made a difference. The experiments were run on a system with an Intel Xeon processor E5-2620 v4 (eight cores), 64GB of RAM and an Intel Pro 6000p series 512GB SSD drive. We used default settings for all the benchmarks and ran each at least three times to ensure the results are statistically significant. All but 2 of the benchmarks where mmap made a difference show significant improvements Overall, there are substantial performance improvements for these workloads, but there are a couple interesting exceptions. For 500m-btree-50r50u and for update-btree some operations (e.g., updates or inserts) are a bit slower with mmap, but others (typically reads) are substantially faster. It appears that some operations benefit from mmap at the expense of others; we are still investigating why this is happening. One of the variables that explains improved performance with mmap is increased rate of I/O. For example, for the 500m-btree-50r50u workload (this workload simulates a typical MongoDB load) the read I/O rate is about 30% higher with mmap than with system calls. This statistic does not explain everything: after all, read throughput for this workload is 63% better with mmap than with system calls. Most likely, the rest of the difference is due to more efficient code paths of memory-mapped I/O (as opposed to going through system calls), as observed in earlier work . Indeed, we typically observe a higher CPU utilization when using mmap. Conclusion Throughput and latency of storage devices improve at a higher rate than CPU speed thanks to radical innovations in storage technology and the placement of devices in the system. Faster storage devices reveal inefficiencies in the software stack. In our work we focussed on overhead related to system calls and file system access and showed how it can be navigated by employing memory-mapped I/O. Our changes in the WiredTiger storage engine yielded up to 63% improvement in read throughput. For more information on our implementation, we encourage you to take a look at the files os_fs.c and os_fallocate.c in the os_posix directory of the WiredTiger develop branch . References  List of Intel SSDs. https://en.wikipedia.org/wiki/List_of_Intel_SSDs  Optane DC Persistent Memory. https://www.intel.ca/content/www/ca/en/architecture-and-technology/optane-dc-persistent-memory.html  Linux® Storage System Analysis for e.MMC with Command Queuing, https://www.micron.com/-/media/client/global/documents/products/white-paper/linux_storage_system_analysis_emmc_command_queuing.pdf?la=en  Jian Xu, Juno Kim, Amirsaman Memaripour, and Steven Swanson. 2019. Finding and Fixing Performance Pathologies in Persistent Memory Software Stacks. In 2019 Architectural Support for Program- ming Languages and Operating Systems (ASPLOS ’19). http://cseweb.ucsd.edu/~juk146/papers/ASPLOS2019-APP.pdf  Jian Xu and Steven Swanson, NOVA: A Log-structured File System for Hybrid Volatile/Non-volatile Main Memories, 14th USENIX Conference on File and Storage Technologies (FAST’16). https://www.usenix.org/system/files/conference/fast16/fast16-papers-xu.pdf  Rohan Kadekodi, Se Kwon Lee, Sanidhya Kashyap, Taesoo Kim, Aasheesh Kolli, and Vijay Chidambaram. 2019. SplitFS: reducing software overhead in file systems for persistent memory. In Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP ’19). https://www.cs.utexas.edu/~vijay/papers/sosp19-splitfs.pdf  SDD vs HDD. https://www.enterprisestorageforum.com/storage-hardware/ssd-vs-hdd.html  Why mmap is faster than system calls. https://medium.com/@sasha_f/why-mmap-is-faster-than-system-calls-24718e75ab37  Paul McKinney. What is RCU, fundamentally? https://lwn.net/Articles/262464/ If you found this interesting, be sure to tweet it . Also, don't forget to follow us for regular updates.
Sales Development Series: Meet the EMEA Account Development Team
Sales Development is a crucial part of the Sales organization at MongoDB. Our Sales Development function is broken down into Sales Development Representatives (SDRs), who qualify and validate inbound opportunities from both existing and prospective customers, and Account Development Representatives (ADRs), who support outbound opportunities by planning and executing pipeline generation strategies. Both of these roles offer an excellent path to kickstarting your career in sales at MongoDB. In this blog post, you’ll learn more about our EMEA (Europe, the Middle East, and Africa) outbound ADR team, which is divided into territories covering the UK & Ireland, the Nordics & Benelux, Central Europe, and Southern Europe. Hear from Manager David Sinnott and a few Account Development Representatives about the ADR role, team culture, and how MongoDB is enabling ADRs to grow their career. Check out the first blog in our Sales Development series here . An overview of Account Development in EMEA David Sinnot , Sales Development Manager for the UK & Ireland The Account Development team works very closely with our Enterprise Sales organization, supporting some of our largest customers across all industries. ADRs partner with Enterprise Account Executives to identify and uncover some of the biggest challenges facing their customers and through further discovery, position MongoDB as the solution to help solve whatever these challenges are. I started my own career in tech sales as a Sales Development Representative 11 years ago. In tech sales, reps will have lots of successes and challenges and personally, I have always used these experiences as a way to try and better myself. My advice to reps just starting out is when things are not going to plan, take a step back to analyze the reason why, learn from it, and implement some new methods to avoid it happening again. The opportunity to learn never stops at MongoDB. My team and I learn something new every day! Our products are always evolving and we continue to release added features and functionality, so we continually provide training around all of this. ADRs also spend a great deal of time learning about and implementing the sales methodology frameworks that MongoDB uses across the entire Sales organization. There are promotion paths available to all of the ADRs, whether that be staying in Sales or exploring other parts of the business, such as Marketing or Customer Success. All of the knowledge and skills picked up during their time as ADRs ensure that they hit the ground running once they are promoted to their next role within the business, whatever that may be. Some of the most successful Corporate and Enterprise reps in MongoDB started their own careers here as part of the ADR program. We do our absolute best to support all team members in deciding what is the best career path for them in the long term. MongoDB is disrupting an industry that largely hasn’t changed in over 40 years. We currently have around a 1% market share of the database market, which IDC predicts will be close to $119B by 2025, so the potential for MongoDB is still massive. With data being at the core of every modern-day business, organizations are having to modernize their legacy technology stacks and are starting to move more of their business functions to the cloud. MongoDB has an opportunity to play a big part in all of these initiatives and transformations. It’s still an incredibly exciting time for any sales rep out there who may be considering MongoDB for their next move. Hear from some team members Johanna Sterneck , Sr. Account Development Representative for Central Europe I joined MongoDB because I wanted to be part of a fast-growing, successful company that would help me grow professionally and personally. Over the past 10 months, every day has been a new experience and I feel that I’ve become part of something bigger. My onboarding experience was completely remote, but my team, manager, and everyone else at MongoDB have been very welcoming and supportive. The entire onboarding process was very well structured which allowed me to ramp up quickly. As an ADR, persistence in getting things done and positivity are definitely key factors in my role. What’s exciting is learning from the people around me and the great feedback culture we have. My team is very supportive, caring, and fun, and we are all happy to go the extra mile to achieve our goals. Federica Ramondino , Sr. Account Development Representative for Southern Europe I joined MongoDB because I believed it was a company where I could develop my skills and grow professionally. I’ve stayed because it lived up to my expectations! I see a clear career path for myself here, and I am excited to progress into my next role and get closer to my final objective of becoming a manager. To excel in an ADR role, you need dedication, good time and stakeholder management skills, and a positive attitude! My team is an amazing bunch of people that are always positive and keen on helping each other, even in a constantly evolving environment. What’s exciting about this role is all the other teams that you get to work with and learn from, from Sales to Customer Success and Marketing. Ruhan Jay Bora , Sr. Account Development Representative for the UK & Ireland I joined MongoDB because I was keen to work for a company creating experiences for the future, and I wanted to be a key player in helping companies digitally transform. I see myself staying at MongoDB for a while because of the heavy emphasis that leadership places on development. I have monthly catch-up sessions with the VP of Sales for EMEA, VP of Cloud Partners, and regular 1:1’s with my managers. Not a day goes by where I feel like I’m stagnating, and between learning about the latest in tech and sharpening my client-facing skills, there is plenty more room to grow! If you want to be successful as an ADR, the first thing you need to have is a tremendous work ethic. I believe sales is ultimately a game of grit, perseverance, and resilience. It’s not easy to learn so many technical concepts in the span of a few weeks, but our Sales Enablement team has compiled a bevy of excellent and readily digestible content that makes upskilling on MongoDB much easier. I will be moving into a new organization formed by our Sales team called the Associate Account Executive program. I harbor an ambition to become an Enterprise Account Executive, and this program will help me to develop the skills needed to work regularly with some of our most exciting clients! The feeling of seeing a client's satisfaction and astonishment at how MongoDB can solve some of their technical and business challenges truly amazes you. Hearing how great MongoDB is directly from clients makes you realize we really have a great product. I also find that the opportunity to accelerate your career here is extremely tangible. The company is young enough for you to shape your own path and no goal is too ambitious. The ability to engage with senior leadership up to the C-level is great too. Interested in joining the Sales team at MongoDB? We have several open roles on our team and would love for you to transform your career with us!