MongoDB Blog

Articles, announcements, news, updates and more

4 Critical Features for a Modern Payments System

The business systems of many traditional banks rely on solutions that are decades old. These systems, which are built on outdated, inflexible relational databases, prevent traditional banks from competing with industry disruptors and those already adopting more modern approaches. Such outdated systems are ill-equipped to handle one of the core offerings that customers expect from banks today — instantaneous, cashless, digital payments . The relational database management systems (RDBMSes) at the core of these applications require breaking data structures into a complex web of tables. Originally, this tabular approach was necessary to minimize memory and storage footprints. But as hardware has become cheaper and more powerful, these advantages have also become less relevant. Instead, the complexity of this model results in data management and programmatic access issues. In this article, we’ll look at how a document database can simplify complexity and provide the scalability, performance, and other features required in modern business applications. Document model To stay competitive, many financial institutions will need to update their foundational data architecture and introduce a data platform that enables a flexible, real-time, and enriched customer experience. Without this, new apps and other services won’t be able to deliver significant value to the business. A document model eliminates the need for an intricate web of related tables. Adding new data to a document is relatively easy and quick since it can be done without the usually lengthy reorganization that RDBMSes require. What makes a document database different from a relational database? Intuitive data model simplifies and accelerates development work. Flexible schema allows modification of fields at any time, without disruptive migrations. Expressive query language and rich indexing enhance query flexibility. Universal JSON standard lets you structure data to meet application requirements. Distributed approach improves resiliency and enables global scalability. With a document database, there is no need for complicated multi-level joins for business objects, such as a bill or even a complex financial derivative, which often require object-relational mapping with complex stored procedures. Such stored procedures, which are written in custom languages, not only increase the cognitive load on developers but also are fiendishly hard to test. Missing automated tests present a major impediment to the adoption of agile software development methods. Required features Let’s look at four critical features that modern applications require for a successful overhaul of payment systems and how MongoDB can help address those needs. 1. Scalability Modern applications must operate at scales that were unthinkable just a few years ago, in relation to both transaction volume and to the number of development and test environments needed to support rapid development. Evolving consumer trends have also put higher demands on payment systems. Not only has the number of transactions increased, but the responsive experiences that customers expect have increased the query load, and data volumes are growing super-linear. The fully transactional RDBMS model is ill suited to support this level of performance and scale. Consequently, most organizations have created a plethora of caching layers, data warehouses, and aggregation and consolidation layers that create complexity, consume valuable developer time and cognitive load, and increase costs. To work efficiently, developers also need to be able to quickly create and tear down development and test environments, and this is only possible by leveraging the cloud. Traditional RDBMSes, however, are ill suited for cloud deployment. They are very sensitive to network latency, as business objects spread across multiple tables can only be retrieved through multiple sequential queries. MongoDB provides the scalability and performance that modern applications require. MongoDB’s developer data platform also ensures that the same data is available for use with other frequent consumption patterns like time series and full-text search . Thus, there is no need for custom replication code between the operational and analytical datastore. 2. Resiliency Many existing payment platforms were designed and architected when networking was expensive and slow. They depend on high-quality hardware with low redundancy for resilience. Not only is this approach very expensive, but the resiliency of a distributed system can never be reached through redundancy. At the core of MongoDB’s developer data platform is MongoDB Atlas , the most advanced cloud database service on the market. MongoDB Atlas can run in any cloud, or even across multiple clouds, and offers 99.995% uptime. This downtime is far less than typically expected to apply necessary security updates to a monolithic legacy database system. 3. Locality and global coverage Modern computing demands are at once ubiquitous and highly localized. Customers expect to be able to view their cash balances wherever they are, but client secrecy and data availability rules set strict guardrails on where data can be hosted and processed. The combination of geo-sharding, replication, and edge data addresses these problems. MongoDB Atlas in combination with MongoDB for Mobile brings these powerful tools to the developer. During the global pandemic, more consumers than ever have begun using their smartphones as payment terminals. To enable these rich functions, data must be held at the edge. Developing the synchronization of the data is difficult, however, and not a differentiator for financial institutions. MongoDB for Mobile, in addition with MongoDB’s geo-sharding capability on Atlas cloud, offloads this complexity from the developer. 4. Diverse workloads and workload isolation As more services and opportunities are developed, the demand to use the same data for multiple purposes is growing. Although legacy systems are well suited to support functions such as double entry accounting, when the same information has to be served up to a customer portal, the central credit engine, or an AI/ML algorithm, the limits of the relational databases become obvious. These limitations have resulted in developers following what is often called “best-of-breed” practices. Under this approach, data is replicated from the transactional core to a secondary, read-only datastore based on technology that is better suited to the particular workload. Typical examples are transactional data stores being copied nightly into data lakes to be available for AI/ML modelers. The additional hardware and licensing cost for this replication are not prohibitive, but the complexity of the replication, synchronization, and the complicated semantics introduced by batch dumps slows down development and increases both development and maintenance costs. Often, three or more different technologies are necessary to facilitate the usage patterns. With its developer data platform, MongoDB has integrated this replication, eliminating all the complexity for the developers. When a document is updated in the transactional datastore, MongoDB will automatically make it available for full-text search and time series analytics. The pace of change in the payments industry shows no signs of slowing. To stay competitive, it’s vital that you reassess your technology architecture. MongoDB Atlas is emerging as the technology of choice for many financial services firms that want to free their data, empower developers, and embrace disruption. Replacing legacy relational databases with a modern document database is a key step toward enhancing agility, controlling costs, better addressing consumer expectations, and achieving compliance with new regulations. Learn more by downloading our white paper “Modernize Your Payment Systems."

August 8, 2022
Applied

Introducing the Ability to Independently Scale Analytics Node Tiers for MongoDB Atlas

We’re excited to announce analytics node tiers for MongoDB Atlas! Analytics node tiers provide greater control and flexibility by allowing you to customize the exact infrastructure you need for your analytics workloads. Analytics node tiers provide control and flexibility Until now, analytics nodes in MongoDB’s Atlas clusters have used the same cluster tier as all other nodes. However, operational and analytical workloads can vary greatly in terms of resource requirements. Analytics node tiers allow you to enhance the performance of your analytics workloads by choosing the best tier size for your needs. This means you can choose an analytics node tier larger or smaller than the operational nodes in your cluster. This added level of customization ensures you achieve the performance required for both transactional and analytical queries — without the need to over- or under-provision your entire cluster for the sake of the analytical workload. Analytics node tiers are available in both Atlas and Atlas for Government . A standard replica set contains a primary node for reads and writes and two secondary nodes that are read only. Analytics nodes provide an additional read-only node that is dedicated to analytical reads. Choose a higher or lower analytics node tier based on your analytics needs Teams with large user bases using their BI dashboards may want to increase their analytics node tiers above that of their operational nodes. Choosing a higher tier can be useful when you have many users or require more memory to serve analytics needs. Scaling up the entire cluster tier would be costly, but scaling up just your analytics node tiers helps optimize the cost. Teams with inconsistent needs may want to decrease their analytics node tier below that of their operational nodes. The ability to set a lower tier gives you flexibility and cost savings when you have fewer users or analytics are not your top priority. With analytics node tiers, you get more discretion and control over how you manage your analytics workloads by choosing the appropriately sized tier for your analytics needs. Get started today by setting up a new cluster or adding an analytics node tier to any existing cluster. Check out our documentation to learn more.

August 3, 2022
Updates

Video: Canva's Lessons From Scaling MongoDB Atlas to 10 Billion Documents Across 100 Nodes

Running complex, global, and mission-critical infrastructure at scale is difficult, and anyone who has done it for any length of time usually has a few gnarly lessons to share. At MongoDB World in June 2022, we were lucky enough to feature someone who had done just that. Michael Pearson , software engineering team lead at Canva , gave a talk titled “10 Billion Documents: How Canva Scaled MongoDB to 100 Nodes.” I’ve had the pleasure of working alongside Pearson and his team for almost a year now, and his presentation focused on some of the massive challenges (and lessons) they’ve faced over the last two years as they have scaled into tens of terabytes of data and tens of billions of documents. I’m writing this blog to give a few highlights, but I’d recommend everyone check the original talk in full: A tricky problem For the uninitiated, Canva is a visual communication platform that empowers its users to design anything and publish anywhere. Or, as Pearson explained in his talk, “Canva is a really simple way to create beautiful designs and presentations.” Canva’s mission is to empower the world to design, and more than 85 million people in over 190 countries use the platform every month. As you can imagine, this presents a huge data challenge — and opportunity. Canva holds more than 10 billion designs and receives up to 30,000 document requests per second. The success of the platform comes down to providing a fantastic user experience every time, and to do that they need to present their customers with the right data at the right time. “This could be a really tricky problem for a database platform, particularly for a business based in Sydney with many users on the other side of the world,” said Pearson. MongoDB Atlas supports the Document Service, which enables opening, creating, updating, or deleting any design on Canva. The Document Service is critical for every single user — if the Document Service is down, then Canva’s users can’t design. But before we get too far into things, we should probably start with why Canva started using MongoDB in the first place. Flexibility to support rapidly changing requirements Michael Pearson, software engineering team lead at Canva. “Canva was launched to the world back in 2013, when MongoDB was very new to the scene,” explains Pearson. “I'm not sure if there were any other databases that would have been up for the challenge.” From those earliest days, MongoDB's flexible document model was the perfect fit for Canva's increasingly complex designs and document types. “The flexibility that MongoDB gave us in those early days was instrumental to our success. As the Canva platform evolved, we were throwing new schema and new features at it. MongoDB would just handle it.” Its continued innovation and problem-solving means MongoDB remains as valuable to us today as it was in 2012. Michael Pearson, software engineering team lead at Canva At the same time, it was essential that Canva’s engineering team was focused on building Canva, rather than time spent managing the data platform. With that in mind, Canva chose to run MongoDB as a service. After trying out multiple options, they went with MongoLabs and, in 2019, following MongoDB's acquisition of MongoLabs, Canva migrated onto MongoDB Atlas , running on AWS, where they remain to this day. Ten years of relative bliss “Before 2021, we had a very hands-off approach to how we used MongoDB,” said Pearson. “MongoDB just handled it. We didn't have to think about it at all." That's incredible, right? Think about it — for nearly a decade the team barely had to think about their data layer and could spend their time working on new features and making the actual product better for its millions of users around the world. It's what every developer wants. Eventually, though, Canva’s own success created certain challenges around scaling. With the stratospheric increase in growth, the load on the Document Service also continued to increase. MongoDB’s ability to scale horizontally through the use of sharding was critical to overcoming initial scale challenges, something that traditional database management systems would have struggled to achieve, said Pearson. With MongoDB, sharding is distributed or partitioned across multiple machines — useful when no single machine can handle large workloads. In due course, though, some attributes of Canva’s workload presented a new challenge. Said Pearson: “We were unique in that we have one cluster with one collection with a million chunks. Our documents are fairly large, given our product has evolved over the years and we put more and more stuff into our documents.” Or, Canva does many updates to relatively large documents, and by mid-2021 the surge in traffic was causing issues. “Our peak traffic caused three main problems: inability to run the balancer, latency issues, and a disk usage pretty much at capacity,” Pearson explained. “A really ineffective cache caused a really high write load to our cluster. This was causing downstream failures." Pearson discussed some of the tactical solutions the company took. “Disabling the balancer immediately brought us back to service, but now we knew that there was something wrong with that cluster and we couldn’t operate without the balancer,” said Pearson. “We also noticed that the number of chunks in our cluster had skyrocketed, from around 400,000 to just over a million.” Getting to the root of the problem The professional services team at MongoDB discovered that “metadata updates were causing anywhere from a one-second to five-minute stalls in the cluster.” Going from 400,000 chunks to a million chunks, at the rate of a minute of each change, was not optimal. There were three things to address with that cluster: reduce the number of chunks, reduce that disk contention, and reduce the size of documents. “With regard to reducing the number of chunks, we just took any contiguous chunks on a shard and merged them unconditionally,” said Pearson. “This was tooling built in collaboration with MongoDB.” After three months of merging chunks, Canva saw massive improvements in its cluster’s performance. A failure rate during reboot of around 4% dwindled to less than 1% during maintenance operations. Further, to address latency spikes and full-disk capacity, the team formulated a six-step plan to move from network-based storage volumes to locally attached disks. This has proved a huge success. “We were able to run the balancer. Our large spikes in latency were pretty much all gone, and our disk usage was almost at zero,” Pearson said. He continued: "The key takeaway for me is that sharding is great, but it's never a silver bullet. I don't think we would have caught these issues so quickly without such a thorough incident review process and such a close working relationship with MongoDB." What was learned? After presenting all of that information, Pearson closed out the presentation with a few key lessons. For anyone interested in running infrastructure at a massive scale, they are simple and worth taking note of: Take advantage of the flexible document model to accelerate your pace of development. Ensure chunks are distributed uniformly across the cluster in a consistent size. Maintain a thorough incident review process and include your trusted partners (such as MongoDB). Reliability is an essential part of Canva’s engineering practice, and prolonged service disruptions were particularly upsetting not only for engineers but for Canva’s global users. Pearson is glad to report that Canva has seen a turnaround in the number of incidents impacting its Document Service. This has freed the document team to shift focus back to shipping features and ensuring every user has a flawless experience using Canva. Interested in joining Canva as it pursues its mission to empower the world to design? Canva is looking for a software engineer to join its Core Data team. Want to take advantage of the flexible document model to accelerate your pace of development? Learn more about MongoDB Atlas .

August 2, 2022
Applied

Tools for Implementing Zero Trust Security With MongoDB

The practice of protecting IT environments from unauthorized access used to be centered on perimeter security — the strategy of securing the perimeter but allowing unrestricted access inside it. As users became increasingly mobile and IT assets became increasingly dispersed, however, the notion of a network perimeter became obsolete. That strategy has now been replaced by the concept of zero trust. In a zero trust environment, the perimeter is assumed to have been breached. There are no trusted users, and no user or device gains trust simply because of its physical or network location. Every user, device, and connection must be continually verified and audited. MongoDB offers several tools and features for integrating our products into a zero trust environment, including: Security by default Multiple forms of authentication TLS and SSL encryption X.509 security certificates Role-based access control (RBAC) Database authentication logs Encryption for data at rest, in flight, and in use For government customers, MongoDB Atlas for Government is FedRAMP-ready. Security by default MongoDB Atlas clusters do not allow for any connectivity to the internet when they’re first spun up. Each dedicated MongoDB Atlas cluster is deployed in a unique virtual private cloud (VPC) configured to prohibit inbound access. (Free and shared clusters do not support VPCs.) The only way to access these clusters is through the MongoDB Atlas interface. Users can configure IP access lists to allow certain addresses to attempt to authenticate to the database. Without being included on such a list, application servers are unable to access the database. Even the person who sets up the clusters needs to add their IP address to the access list. To find out more about the security measures that protect our cloud-based database, MongoDB Atlas, and the rules governing employee access, read our whitepaper, MongoDB: Capabilities for Use in a Zero Trust Environment . Authentication Customers have several options to allow users to authenticate themselves to a database, including a username and password, LDAP proxy authentication, and Kerberos authentication. All forms of MongoDB support transport layer security (TLS) and SCRAM authentication. They are turned on by default and cannot be disabled. Traffic from clients to MongoDB Atlas is authenticated and encrypted in transit, and traffic between a customer’s internally managed MongoDB nodes is also authenticated and encrypted in transit using TLS. For passwordless authentication, MongoDB offers two different options to support the use of X.509 certificates. The first option, called “easy,” auto-generates the certificates needed to authenticate database users. The “advanced” option is for organizations already using X.509 certificates and that already have a certificate management infrastructure. The advanced option can be combined with LDAPS for authorization. Access infrastructure can only be reached via bastion hosts and by users for whom senior management has approved backend access. These hosts require multifactor authentication and are configured to require SSH keys — not passwords. Logging and auditing MongoDB supports a wide variety of auditing strategies, making it easier to monitor your zero trust environment to ensure that it remains in force and encompasses your database. Administrators can configure MongoDB to log all actions or apply filters to capture only specific events, users, or roles. Role-based auditing lets you log and report activities by specific role, such as userAdmin or dbAdmin, coupled with any roles inherited by each user, rather than having to extract activity for each individual administrator. This approach makes it easier for organizations to enforce end-to-end operational control and maintain the insight necessary for compliance and reporting. The audit log can be written to multiple destinations in a variety of formats, such as to the console and syslog (in JSON) and to a file (JSON or BSON). It can then be loaded to MongoDB and analyzed to identify relevant events. Encryption MongoDB also lets you encrypt data in flight, at rest, or even, with field-level encryption and queryable encryption , in use. For data in motion, all versions of MongoDB support TLS and SSL encryption. For data at rest, MongoDB supports AES-256 encryption, and it can also be configured for FIPS compliance. To encrypt data when it is in use, MongoDB offers client-side field-level encryption , which can be implemented to safeguard data even from database administrators and vendors who otherwise would have access to it. Securing data with client-side field-level encryption allows you to move to managed services in the cloud with greater confidence. The database only works with encrypted fields, and organizations control their own encryption keys, rather than having the database provider manage them. This additional layer of security enforces an even more fine-grained separation of duties between those who use the database and those who administer and manage it. MongoDB Atlas exclusively offers queryable encryption, which allows customers to run rich expressive queries on fully randomized encrypted data with efficiency, improving both the development process and user experience. Organizations are able to protect their business by confidently storing sensitive data and meeting compliance requirements. Zero trust and MongoDB MongoDB is optimally suited for use within a zero trust environment. MongoDB is secure by default and has developed industry-leading capabilities in key areas such as access, authorization, and encryption. Used together, these features help protect the database from outside attackers and internal users who otherwise could gain an unauthorized level of access. For more detailed information about security features in MongoDB, read our whitepaper, MongoDB: Capabilities for Use in a Zero Trust Environment .

August 2, 2022
Applied

Integrating Support for Non-Volatile Memory Into WiredTiger

Intel Optane DC Persistent Memory is a non-volatile memory (NVRAM) product that resembles both storage and memory and can be used as either. Like storage, Optane NVRAM retains data after a crash or power outage. Like memory, it sits on the memory bus and can be accessed by CPU using load/store instructions. In certain scenarios, its access latency even approaches that of dynamic random access memory (DRAM). At MongoDB, we have been thinking about how to use NVRAM in the storage engine. It can be seen as an extension of volatile DRAM, but a denser and a cheaper one. In pursuit of this goal, we extended our storage engine, WiredTiger , with a volatile NVRAM cache that retains frequently used file blocks. In this article, we share our experience, describe the lessons learned, and evaluate the costs and benefits of this approach. How to use NVRAM in the storage stack Optane NVRAM can act as both storage and memory. The persistent memory fabric itself can be packaged as a solid-state drive (SSD), as in Optane SSDs, or as a dual-inline memory module (DIMM) that looks almost like its DRAM counterpart and lives in the same type of slot on the motherboard. Even when NVRAM is packaged as a non-volatile DIMM (NVDIMM), we can ask the operating system to present it as a block device, put a file system on top, and use it just like regular storage. Broadly speaking, there are three ways to use NVRAM: As regular storage As persistent memory As an extension to volatile memory NVRAM as storage Using NVRAM as regular storage can deliver superior throughput (compared to SSD) for read-dominant workloads, but this approach hinders write-dominant workloads because of Optane NVRAM’s limited write throughput (see the section “Performance properties of Optane NVRAM”). In any case, both the price and density of NVRAM are closer to those of DRAM than to those of SSD, so using it as storage is not recommended. NVRAM as persistent memory Imagine that all your data structures live in memory and that you never have to worry about saving them to files. They are just there, even after you quit your application or if it suffers a crash. Although this setup sounds simple, in practice, it is still challenging to program for this model. If your system crashes and you would like to be able to find your data after restart, you need to name it. A variable name is not sufficient, because it is not unique; thus, you have to restructure your code to make sure your data has persistent identifiers. Persistent Memory Development Kit (PMDK) provides APIs for that. A more difficult problem is surviving a crash. Your program may crash in the middle of a logical operation on a data structure. For example, suppose you are inserting an item into a linked list, and you have set the source pointer, but the crash occurs before setting the destination pointer. Upon restart, you’ll end up with corrupted data. To make matters worse, even if the logical operation had completed before the crash, the data might have been written only to CPU caches but not persisted to the memory itself . One solution is to wrap memory operations in transactions; however, programming transactional memory is notoriously difficult. Another solution is to use prepackaged data structures and APIs, but if you are looking to create your own highly optimized data structures, you must implement your own logging and recovery or other mechanisms that protect your data similarly to transactions. NVRAM as an extension of volatile memory Somewhat counterintuitively, this option involves disregarding the persistence of NVRAM and using it as a volatile extension of DRAM. Why would you want to do that? Suppose you have a fixed budget to buy extra memory for your system. You can either afford N GB of DRAM or about M*N GB of NVRAM — that’s because NVRAM is denser and cheaper per byte than DRAM (about three times cheaper, at the time of writing). Depending on your application, you might be better off in terms of performance/$$ if you buy additional NVRAM, as opposed to DRAM. In support of this use case, Intel provides a hardware mechanism, called Memory Mode, which treats NVRAM as “regular” system memory and uses DRAM as its cache. In other words, the hardware will do its best to place frequently used data structures in DRAM, and the rest will reside in NVRAM. The beauty of this mechanism is that it requires absolutely no changes to applications. The downside is that it may perform worse than a custom solution for certain workloads (see section “How NVCache affects performance”). Our solution is a custom-built volatile cache that resides in NVRAM. Our architecture Our NVRAM cache (or NVCache) is a component of the MongoDB storage engine WiredTiger. For persistent storage, WiredTiger organizes data into blocks, where keys and values are efficiently encoded and (optionally) compressed and encrypted. For fast query of its B+-tree data structure, WiredTiger transforms blocks into pages, where keys/values are decoded and indexed. It keeps pages in its DRAM page cache. Figure 1.  The architecture of NVCache. Figure 1 shows the architecture of NVCache. NVCache is the new component, and the rest are part of WiredTiger. NVCache sits next to the block manager, which is the code responsible for reading/writing the data from/to persistent storage. Let’s look at each path in turn. Read path: If the page cache cannot locate the searched-for data, it issues a read to the block manager (1). The block manager checks whether the block is present in NVCache (2), accesses it from NVCache if it is (3), and reads it from disk if it is not (4). The block manager then transforms the block into a page, decrypting and decompressing it if needed, and then hands it over to the page cache (5). It also notifies NVCache that it has read a new block, and NVCache then has the discretion to accept it (6). NVCache stores the blocks in the same format as they are stored on disk (e.g., compressed or encrypted if those configuration options were chosen). Write path: The write path differs from the read path in that WiredTiger does not modify disk blocks in place. It writes updates into in-memory data structures and then converts them into new pages, which would be sent to disk either during eviction from the page cache or during a checkpoint (7). When the block manager receives a new page, it converts it into a new block, writes the block to storage (8), and informs NVCache (9). NVCache then has the discretion to accept it. Obsolete blocks are eventually freed, at which time the block manager instructs NVCache to invalidate cached copies (10). To avoid running out of space, NVCache periodically evicts less-used blocks. The eviction thread runs once a second. Overall, this design is straightforward, but making it performant was a challenge. As expected with brand new storage or memory devices, the software must cater to their unique performance properties. In the next section, we focus on these performance features and explain how we adapted our cache to play along. Performance properties of Optane NVRAM In low-bandwidth scenarios, the access latency of Optane NVRAM approaches that of DRAM. A small read takes about 160 to 300 nanoseconds, depending on whether it is part of a sequential or a random access pattern1; a read from DRAM takes about 90 nanoseconds.3 Small writes are as fast as in DRAM3 because the data only has to reach the memory controller, where it will be automatically persisted in case of a power loss. In high-bandwidth scenarios, we usually look at throughput. Sequential read throughput is about 6 GB/s for a single NVDIMM 1,2 and scales linearly as you add more memory modules. (A single 2nd Generation Intel Xeon Scalable processor can support up to six NVDIMMs.) The write throughput is more limited: We observed up to 0.6 GB/s on a single NVDIMM2, and others observed up to 2.3 GB/s. 3 Again, if your workload writes to different NVDIMMs, the throughput will scale with the number of modules in your system. A somewhat troublesome observation about write throughput is that it scales negatively as you add more threads. Write throughput peaks at one or two concurrent threads and then drops as more threads are added. 2,3 More importantly, we were surprised to find that, on Optane NVRAM, the presence of writers disproportionately affects the throughput of readers. Figure 2.  Read throughput in presence of concurrent writer threads. Figure 2 shows how the throughput of eight reader threads drops as more concurrent writers are added. Although this effect is present on both DRAM and NVRAM (and certainly on other storage devices), on Optane NVRAM, the effect is much more pronounced. Performance of reads will suffer in the presence of writes. This important observation drove the design of our NVCache. Throttling writes in caches for Optane NVRam For a cache to be useful, it must contain popular data. The duties of admitting fresh data and expunging the old fall on cache admission and eviction policies, respectively. Both admission and eviction generate writes, and, because writes hurt the performance of reads on Optane, admission and eviction will interfere with performance of cache retrievals (which involve reads). Thus, we have a trade-off: On one hand, admission and eviction are crucial to making the cache useful. On the other hand, the write operations that they generate will hamper the performance of data retrievals, thereby making cache less performant. To resolve this tension, we introduced the Overhead Bypass (OBP) metric, which is a ratio of reads and writes applied to the cache. Keeping this ratio under a threshold allowed us to limit the overhead of writes: OBP = (blocks_inserted + blocks_deleted) / blocks_looked_up Intuitively, blocks_looked_up correlates with the benefit of using the cache, whereas the sum of blocks_inserted and blocks_deleted correlates with the cost. NVCache throttles admission and eviction to keep this ratio under 10%. (Our source code is available in the WiredTiger public GitHub repository .) Without OBP, the sheer overhead of data admission and eviction was quite substantial. To measure this overhead in its purest form, we experimented with workloads that do not stand to benefit from any extra caching, such as those with small datasets that fit into the OS buffer cache (in DRAM) or those that perform so many writes that they quickly invalidate any cached data. We found that using NVCache without the OBP feature caused these workloads to run up to two times slower than without the cache. Introducing the OBP completely eliminated the overhead and enabled the workloads that stand to benefit from extra caching to enjoy better performance. How NVCache affects performance In this section, we’ll look in detail at the performance of workloads with large datasets that stand to benefit from an additional cache. Experimental system: The following experiments were performed on a Lenovo ThinkSystem SR360 with two Intel Xeon Gold 5218 CPUs. Each CPU has 16 hyper-threaded cores. The system has two Intel Optane persistent memory modules of 126 GB each. For storage, we used an Intel Optane P4800X SSD. We configured our system with only 32 GB of DRAM to make sure that extra memory in the form of NVRAM would be called for. We present the data with widely used YCSB benchmarks 4,5 (Table 1), although we also performed analysis with our in-house benchmarks and reached similar conclusions. Table 1.   Characteristics of YCSB benchmarks The following charts compare the throughput of YCSB with NVCache, with Intel Memory Mode (MM), and with OpenCAS6 — a kernel implementation of NVRAM-resident cache from Intel. OpenCAS was configured in the write-around mode, which was the best option for limiting the harmful effect of writes.7 Figures 3a-c shows the data in configurations using 63 GB, 126 GB, and 252 GB of NVRAM, respectively. Figure 3.   Throughput of YCSB under Memory Mode (MM), OpenCAS, and NVCache relative to running with DRAM only. We make the following three observations: OpenCAS cache delivers no performance benefit from extra NVRAM. It achieves a similar or better read hit rate as the NVCache but also makes two orders of magnitude more writes to NVRAM, probably because it does not throttle the rate of admission. Writes interfere with performance of reads, which is probably why this cache delivers no performance benefits. When the dataset size exceeds NVRAM capacity, NVCache provides substantially better performance than Memory Mode. As shown in Figure 3a, NVCache outperforms the memory mode by between 30% (for YCSB-B) and 169% (for YCSB-C). Furthermore, the memory mode hurts YCSB-A’s update throughput by about 18% relative to the DRAM-only baseline, while NVCache does not. Memory mode performs comparably to NVCache when NVRAM is ample. With 252 GB of NVRAM, all datasets comfortably fit into the NVRAM. Two factors explain why NVCache loses its edge over MM with ample NVRAM: (1) For NVCache, the marginal utility of additional NVRAM is small after 126 GB; NVCache hit rate grows by about 20% when we increase NVRAM size from 63 GB to 126 GB, but only by another 5% if we increase it from 126 GB to 252 GB. (2) While MM allows the kernel buffer cache to expand into NVRAM, NVCache confines it to DRAM, which is also used by the WiredTiger’s page cache. Contention for DRAM limits performance. Overall, the benefit of a custom NVRAM cache solution is that it provides better performance than the Memory Mode for large workloads. The disadvantage is that it requires new software, whereas MM can be used without any changes to applications. Performance and cost In this section, we explore the trade-offs of using Optane NVRAM as a volatile extension of DRAM versus just using more DRAM. To that end, we take a fixed memory budget of 96 GB and vary the fraction satisfied by DRAM and NVRAM as shown in Table 2. Table 2.   Budget of memory configurations containing both DRAM and NVRAM relative to DRAM-only. We use the NVRAM-to-DRAM price ration of 0.38. 8 Figure 4.   Performance per dollar as the amount of NVRAM increases and the amount of DRAM decreases (in YCSB workloads). Figure 4 shows the performance of YCSB under these configurations normalized to using 96 GB DRAM and divided by the cost ratio in column 3. In other words, these are performance/$ numbers relative to the DRAM-only configuration. In these experiments, we used only NVCache to manage NVRAM, as it performed comparably to or better than other options. Positive numbers mean that the performance decreased less than the memory cost. Read-only or read-mostly workloads that benefit from the NVCache experience a positive gain, as expected. Although in most cases performance predictably drops as the amount of DRAM decreases, YCSB-C in configuration with 64 GB NVRAM and 32 GB DRAM performs better than it does with 96 GB DRAM — so we decrease the system cost and improve performance in absolute terms. This occurs because beyond 32 GB of DRAM, the utility of additional memory (and a larger page cache) is considerably smaller than the loss in performance due to a smaller NVCache. YCSB-A, whose write intensity prevents it from deriving benefits of any additional caching, suffers the overall loss in terms of performance/$. Its performance drops at a steeper rate than the memory cost as we decrease the amount of DRAM. We conclude that NVRAM is a cost-effective method of reducing memory cost while balancing the impact on performance for read-dominant workloads. At the same time, even a modest presence of writes can render NVRAM unprofitable relative to DRAM. References J. Izraelevitz, et al. Basic Performance Measurements of the Intel Optane DC Persistent Memory Module . arXiv:1903.05714. We Replaced an SSD with Storage Class Memory. Here is What We Learned by Sasha Fedorova. The MongoDB Engineering Journal. Jian Yang, et al. An Empirical Guide to the Behavior and Use of Scalable Persistent Memory . USENIX File Access and Storage Conference (FAST 2020) . Yahoo! Cloud Serving Benchmark, Git Repo . B.F. Cooper, et al. Benchmarking Cloud Serving Systems with YCSB . SoCC '10: Proceedings of the 1st ACM Symposium on Cloud Computing . Open Cache Acceleration Software . Open CAS Linux — Admin Guide . H.T. Kassa, et al. Improving Performance of Flash Based Key-value Stores Using Storage Class Memory as a Volatile Memory Extension . USENIX Annual Technical Conference (USENIX ATC 21) .

July 25, 2022
Engineering Blog

Connected Data: How IoT Will Save Healthcare and Why MongoDB Matters

Over the next decade, healthcare systems around the world will face a two-fold challenge: Delivering higher quality care while managing rising costs, and doing so for increasingly larger populations of patients. For decades, healthcare systems have operated predominantly with traditional fee-for-service models, in which reimbursements are given to providers based on services rendered. Value-based healthcare, in contrast, attempts to lower the cost of care by keeping patients healthier longer through more effective and efficient use of healthcare systems. This article — Part 2 of our series on connected healthcare data — looks at how IoT, with support from MongoDB, can help meet future healthcare challenges. Read Part 1 of this series on connected healthcare data Increased demand It's expected that by 2050, 22% of the world's population will be over 60 years old . This adds increased pressure to the goals of optimizing both patient outcomes and healthcare spend, because there are more people within healthcare systems than ever before. And, as these patients live longer, they experience more chronic conditions and, therefore, require more care. Constraints on the ability to graduate enough doctors and nurses to meet this surge of healthcare demand suggest that innovation will be needed to provide adequate supply. Additionally, many healthcare services are delivered in an exam or hospital room, where patient vitals and observations are captured, a chart is reviewed, and medications and treatments are ordered. According to a recent study from the Annals of Internal Medicine , providers spend more than 16 minutes per encounter on these tasks alone. Observation and data collection in healthcare is critical to identifying and subsequently adjusting treatment pathways; however, the process is heavily reliant on in-person visits. How IoT will save healthcare Global adoption of the Internet of Things (IoT) is soaring across numerous industries. In fact, healthcare is forecasted to be the second largest industry in value for IoT by 2030. IoT offers the ability to remotely monitor patients via wearables and connected devices. It provides the means to collect data beyond the patient exam or hospital room and can help providers deliver care outside of traditional, in-person methods. With this power to collect more information, more often, and do so with fewer patient encounters, IoT plays a role in solving the two-fold challenge of delivering better quality of care for increasingly larger populations of patients. A patient wearing a smartwatch, for example, may be able to stream heart rate and oxygen saturation levels during real-world activities to an electronic healthcare record, where the data can be aggregated and summarized for a physician to review, or even for a machine-learning algorithm to periodically interrogate. IoT devices can help collect more data, more often, to help providers deliver more meaningful, timely, and impactful healthcare recommendations and treatments to patients. Through this added value, IoT can further the benefits of telemedicine and promote the idea of “care anywhere,” in which healthcare is not directly tied to or dependent upon in-person encounters. Challenges of healthcare data on the move What challenges face developers when it comes to capturing and leveraging data from healthcare IoT devices? Four significant capabilities top the list, which we will look at in turn: Scalable and efficient storage Global coverage and data synchronization Interoperability Security and privacy Scalable and efficient storage IoT devices have the capability to produce massive volumes of continuous data. In fact, market intelligence provider International Data Corporation (IDC) predicts that IoT devices alone will produce 74.9 ZB of data by 2025, from a staggering 55.9 billion devices. A cloud-based developer data platform will be critical to support these kinds of massive data volumes, which may also exhibit unpredictable peaks in workloads. Additionally, as is the case for many IoT use cases, often only the most recent data is used for analysis. In this scenario, the ability to automatically archive raw and historical data to a more cost-effective storage, and yet still be able to query it when and if needed, would be ideal. MongoDB’s Atlas Online Archive lets developers do just that, with minimal setup and configuration required, as shown in Figure 1. Figure 1. MongoDB automates data tiering while keeping it queryable with Atlas Online Archive. Not all databases are ready to deal with the massive, continuous data generated by IoT devices. Sensor data is typically collected with high frequency, which may mean high concurrency of writes, unpredictable workload peaks, and the need for dynamic scalability. Additionally, IoT data is almost by definition time-series data, meaning it typically comes with a timestamp that allows following the evolution of a parameter through time, at regular or irregular time intervals. Storing time-series data efficiently at scale can be difficult. In fact, specialized time-series databases exist to tackle workloads such as these. Additionally, storing the data is simply one side of the challenge. Another aspect involves running analytics as the data is collected, such as discovering heart anomalies and sending alerts in real time to the patient. Using specialized time-series databases solves these challenges but also introduces new ones: Developers will need to learn the nuances of working with a niche platform, slowing development cycles. Building and maintaining ETL pipelines to move data and merge data across different platforms. Integrating, securing, and maintaining an additional database platform, thereby increasing operational overhead. MongoDB's new time series collection feature allows you to automatically optimize your schema and deployment for high storage efficiency and low-latency queries, without the need of an additional, niche database. Additionally, MongoDB integrates time-series data with operational data and analytics capabilities in one unified environment with built-in scalability, delivering the performance your IoT applications need while simplifying your architecture. Global coverage and data synchronization For many IoT scenarios, users are effectively on the move: They go to work, they go shopping, and they get on planes to see the new beautiful shiny star on top of Barcelona's Sagrada Família. With all of this mobility, they might lose connectivity for a few minutes or even hours. Tracking their health effectively in real time is not just a nice feature, it may be mandatory. Using MongoDB’s Atlas Device Sync , developers can easily deploy IoT applications that seamlessly handle drops in connectivity, without missing critical write operations of the most important data workloads. Interoperability Most IoT devices use proprietary protocols and operating systems, which seriously limit interoperability. The IoT industry advocates the use of standard communication protocols such as MQTT, but, as of this writing, there is no single industry standard. Custom solutions exist that serve one single type of sensor and/or healthcare provider, but these solutions tend to suffer from interoperability challenges when interlinking data across different healthcare networks. As discussed in our first post , sharing healthcare data across different participants of the healthcare ecosystem requires standards such as JSON-based FHIR, which is key to mitigate healthcare fragmentation. Learn how we used MongoDB and MQtt to "listen" and "talk" remotely to an IoT-powered facility. Downloadable code available. Security and privacy Given its sensitive and personal nature (and relatively easy monetization through theft), health data is especially appealing to bad actors. The number of security incidents impacting healthcare systems is sobering. According to a report by Crowdstrike , 82% of health systems experienced some form of IoT cyberattack in 2020. With IoT proliferation on the rise, the need for the highest level of security at the application level and at the database level, becomes non-negotiable. Unsurprisingly, McKinsey cites interoperability, security, and privacy as major headwinds for IoT adoption, especially for healthcare. How MongoDB supports IoT challenges Here's a visual view of how MongoDB helps developers bring IoT applications to market faster: Scalability and efficient storage Global coverage and data synchronization High availability and scalability are built in via replication and native sharding. Online Archive automatically archives aged data to a fully managed cloud object storage, so you can optimize cost and performance without sacrificing data accessibility. Time series collections automatically optimize your schema for high storage efficiency, low-latency queries, and real-time analytics. MongoDB Atlas is a global, multi-cloud platform that lets your apps run anywhere in the world. Atlas Device Sync solves conflict resolution and keeps your data up to date across devices, users, and your backend, regardless of connectivity. Interoperability Security and privacy The document model provides a flexible schema and maps exactly to the objects that developers work with in their code. Different industry communication standards are being built over JSON, such as FHIR, which is a natural fit to MongoDB's document model. Thanks to MongoDB Client-side Field Level Encryption , data is encrypted in motion, in memory, and at rest. Queryable Encryption allows running expressive queries on fully randomized encrypted data. MongoDB provides the strongest levels of data privacy and security for regulated workloads. MongoDB Atlas takes care of the backend, removing friction from the development process and simplifying your technology stack, so you can focus on building differentiating features for your applications. Atlas is a developer data platform that supports a broad array of use cases, from operational to transactional and through analytical workloads. Atlas also offers the following features: Ability to service more loads of the data lifecycle: Enabling development teams to seamlessly analyze, transform, and move data while reducing reliance on batch processes or ETL jobs Built on a modern data model: Aligning to the way developers think and code Integrated: Delivering an elegant developer experience Figure 2. Atlas is a developer data platform built on three pillars: the document model, a unified interface for different data use cases, and a multi-cloud, enterprise-ready foundation. MongoDB for IoT-powered healthcare apps IoT and specifically wearables will play a major role in solving the two-fold challenge of delivering better quality care for increasingly larger populations of patients. The soaring adoption of wearables is accelerating the need for a developer data platform that helps software delivery teams build and manage health applications with: Scalable and efficient storage Global coverage and data synchronization Interoperability Security and privacy MongoDB Atlas is a developer data platform designed to manage the heavy lifting for you, by providing an elegant developer experience and unifying a broad range of data workloads with world-class privacy and security features. Read Part 1 of this series on connected healthcare data , and learn more about MongoDB Atlas and the healthcare industry .

July 25, 2022
Applied

Mobile Edge Computing, Part 1: Delivering Data Faster with Verizon 5G Edge and MongoDB

As you’ve probably heard, 5G is changing everything, and it’s unlocking new opportunities for innovators in one sector after another. By pairing the power of 5G networks with intelligent software, customers are beginning to embrace the next generation of industry, such as powering the IoT boom, enhancing smart factory operations, and more. But how can companies that are leveraging data for daily operations start using data for innovation? In this article series, we’ll explore how the speed, throughput, reliability, and responsiveness of the Verizon network, paired with the sophistication of the next generation MongoDB developer data platform, are poised to transform industries including manufacturing, agriculture, and automotive. Mobile edge computing: The basics Companies everywhere are facing a new cloud computing paradigm that combines the best experiences of hyperscaler compute and storage with the topological proximity of 5G networks. Mobile edge computing , or MEC, introduces a new mode of cloud deployments whereby enterprises can run applications — through virtual machines, containers, or Kubernetes clusters — within the 5G network itself, across both public and private networks. Before we dive in, let’s define a few key terms: What is mobile edge computing? The ability to deploy compute and storage closer to the end user What is public mobile edge computing? Compute and storage deployed with the carrier data centers What is private mobile edge computing? On-premise provisioned compute and storage Verizon 5G Edge , Verizon’s mobile edge compute portfolio, takes these concepts from theoretical to practical. By creating a unified compute mesh across both public and private networks, Verizon 5G Edge produces a seamless exchange of data and stateful workloads — a simultaneous deployment of both public and private MEC best characterized as a hybrid MEC. In this article, we’ll primarily focus on public MEC deployment. Although MEC vastly increases the flexibility of data usage by both practitioners and end users, the technology is not without its challenges, including: Deployment: Given a dynamic fleet of devices, in an environment with 20-plus edge zones across both public and private MEC, to which edge(s) should the application be deployed? Orchestration: For Day 2 operations and beyond, what set of environmental changes, — be it on the cloud, network, or on device(s) — should trigger a change to my edge environment? Edge discovery: Throughout the application lifecycle, for a given connected device, which edge(s) is the optimal endpoint for connection? Fortunately for developers, Verizon has developed a suite of network APIs tailored to answer these questions. From edge discovery and network performance to workload orchestration and network management, Verizon has drastically simplified the level of effort required to build resilient, highly available applications at the network edge without the undifferentiated heavy lifting previously required. Edge discovery API workflow Using the Verizon edge discovery API, customers can let Verizon manage the complexity of maintaining the service registry as well as identifying the optimal endpoint for a given mobile device. In other words, with the edge discovery API workflow in place of the self-implemented latency tests, a single request-response would be needed to identify the optimal endpoint, as shown in Figure 1. Figure 1.   A single request-response is used to identify the optimal endpoint Although this API addresses challenges of service discovery, routing, and some advanced deployment scenarios, other challenges exist outside of the scope of the underlying network APIs. In the case of stateful workloads, for example, how might you manage the underlying data generated from your device fleet? Should all of the data live at the edge, or should it be replicated to the cloud? What about replication to the other edge endpoints? Using the suite of MongoDB services coupled with Verizon 5G Edge and its network APIs, we will describe popular reference architectures for data across the hybrid edge. Delivering data with MongoDB Through Verizon 5G Edge, developers can now deploy parts of their application that require low latency at the edge of 4G and 5G networks using the same APIs, tools, and functionality they use today, while seamlessly connecting back to the rest of their application and the full range of cloud services running in a cloud region. However, for many of these use cases, a persistent storage layer is required that extends beyond the native storage and database capabilities of the hyperscalers at the edge. Given the number of different edge locations where an application can be deployed and consumers can connect, ensuring that appropriate data is available at the edge is critical. It is also important to note that where consumers are mobile (e.g., vehicles), the optimal edge location can vary. At the same time, having a complete copy of the entire dataset at every edge location to cater for this scenario is neither desirable nor practical due to the potentially large volumes of data being managed and the associated multi-edge data synchronization challenges that would be introduced. The Atlas solution The solution requires having an instantaneous and comprehensive overview of the dataset stored in the cloud while synchronizing only required data to dedicated edge data stores on demand. For many cases, such as digital twin, this synchronization needs to be bi-directional and may potentially include conflict resolution logic. For others, a simpler unidirectional data sync would suffice. These requirements mean you need a next-gen data platform, equipped with all the power to simplify data management while also delivering data in an instant. MongoDB Atlas is the ideal solution for the central, cloud-based datastore. Atlas provides organizations with a fully managed, elastically scalable application data platform upon which to build modern applications. MongoDB Atlas can be simultaneously deployed across any of the three major cloud providers (Amazon Web Services, Microsoft Azure, and Google Cloud Platform) and is a natural choice to act as the central data hub in an edge or multi-edge based architecture, because it enables diverse data to be ingested, persisted, and served in ways that support a growing variety of use cases. Central to MongoDB Atlas is the MongoDB database, which combines a flexible document-based model with advanced querying and indexing capabilities. Atlas is, however, more than just the MongoDB database and includes many other components to power advanced applications with diverse data requirements, like native search capabilities, real-time analytics, BI integration, and more. Read the next post in this blog series to explore the real-world applications and innovations being powered by mobile edge computing.

July 21, 2022
Applied

Mobile Edge Computing, Part 2: Computing in the Real World

It would be easy to conceptualize mobile edge computing (MEC) as a telecommunications-specific technology ; but, in fact, edge computing has far-reaching implications for real-world use cases across many different industries. Any organization that requires a solution to common data usage challenges, such as low-latency data processing, cloud-to-network traffic management, Internet of Things (IoT) application development, data sovereignty, and more, can benefit from edge-based architectures. In our previous article , we discussed what mobile edge computing is, how it helps developers increase data usage flexibility, and how Verizon 5G Edge and MongoDB work in conjunction to enable data computing at the edge, as shown in Figure 1. Figure 1.  Verizon and MongoDB work in conjunction to deliver data to consumers and producers faster than ever with mobile edge computing. In this article, we’ll look at real-world examples of how mobile edge computing is transforming the manufacturing, agriculture, and automotive industries. Smart manufacturing Modern industrial manufacturing processes are making greater use of connected devices to optimize production while controlling costs. Connected IoT devices exist throughout the process, from sensors on manufacturing equipment to mobile devices used by employees on the factory floor to connected vehicles transporting goods — all generating large amounts of data. For companies to realize the benefits of all this data, it is critical that the data be processed and analyzed in real time to enable rapid action. Moving this data from the devices to the cloud for processing introduces unnecessary latency and data transmission that can be avoided by processing at the edge. As seen in Figure 2, for example, sensors, devices, and other data sources in the smart factory use the Verizon 5G Edge Discovery Service to determine the optimal edge location. After that, data is sent to the edge where it is processed before being persisted and synchronized with MongoDB Atlas — all in an instant. Figure 2.   Data sources in smart factories use the Verizon 5G Edge Discovery Service to determine the optimal edge location. Process optimization Through real-time processing of telemetry data, it’s possible to make automated, near-instantaneous changes to the configuration of industrial machinery in response to data relayed from a production line. Potential benefits of such a process include improved product quality, increased yield, optimization of raw material use, and ability to track standard key performance indicators (KPIs), such as overall equipment efficiency (OEE). Preventative maintenance Similar to process optimization, real-time processing of telemetry data can enable the identification of potential impending machinery malfunctions before they occur and result in production downtime. More critically, however, if a situation has the potential either to damage equipment or pose a danger to those working in the vicinity, the ability to automatically perform shut downs as soon as the condition is detected is vital. Agriculture One of the most powerful uses of data analytics at scale can be seen in the agriculture sector . For decades, researchers have grappled with challenges such as optimal plant breeding and seed design, which to date have been largely manual processes. Through purpose-built drones and ground robotics, new ways to conduct in-field inspection using computer vision have been used to collect information on height, biomass, and early vigor, and to detect anomalies. However, these robots are often purpose-built with large data systems on-device, requiring manual labor to upload the data to the cloud for post-processing. Using the edge, this entire workflow can be optimized. Starting with the ground robotics fleet, the device can be retrofitted with a 5G modem to disintermediate much of the persistent data collection. Instead, the device can collect data locally, extract relevant metadata, and immediately push data to the edge for real-time analytics and anomaly detection. In this way, field operators can collect insights about the entirety of their operations — across a given crop field or nationwide — without waiting for the completion of a given task. Automotive Modern vehicles are more connected than ever before, with almost all models produced today containing embedded SIM cards that enable even more connected experiences. Additionally, parallel advances are being made to enable roadside infrastructure connectivity. Together, these advances will power not just increased data sharing between vehicles but also between vehicles and the surrounding environment (V2V2X). In the shorter term, edge-based data processing has the potential to yield many benefits both to road users and to vehicle manufacturers . Data quality and bandwidth optimization Modern vehicles have the ability to transmit large amounts of data not only in terms of telemetry relating to the status of the vehicle but also in regard to the observed status of the roads. If a vehicle detects that it is in a traffic jam, for example, then it might relay this information so that updates can be made available to other vehicles in the area to alert drivers or replan programmed routes, as shown in Figure 3. Figure 3.  Mobile edge computing enables data generated from multiple sources within a vehicle to be shared instantly. Although this is a useful feature, many vehicles may be reporting the same information. By default, all of this information will be relayed to the cloud for processing, which can result in large amounts of redundant data. Instead, through edge-based processing: Data is shared more quickly between vehicles in a given area using only local resources. Costs relating to cloud-based data transfer are better controlled. Network bandwidth usage is optimized. While improving control of network usage is clearly beneficial, arguably a more compelling use of edge-based processing in the automotive industry relates to aggregating data received from many vehicles to improve the quality of data sent to the cloud-based data store. In the example of a traffic jam, all of the vehicles transmitting information about the road conditions will do so based on their understanding gained through GPS as well as internal sensors. Some vehicles will send more complete or accurate data than others, but, by aggregating the many different data feeds at the edge, this process results in a more accurate, complete representation of the situation. The future Read Part 1 of this blog series . Download our latest book on computing at the edge .

July 21, 2022
Applied

Meet Gabriella Cohen: Expanding MongoDB’s Presence in Israel as the Newest Regional Vice President

Gabriella Cohen recently joined MongoDB’s sales leadership team as Regional Vice President for Israel. We sat down to discuss her career move from Amazon Web Services (AWS) to MongoDB and the opportunities for sellers in the Israeli market. Jackie Denner: How did you come to join MongoDB, and what made you interested in the company? Gabriella Cohen: I have been privileged to work with some of the most tech-savvy and innovative startups in Israel, and many of them use MongoDB as part of their core products. Developers love the solution. This drove me to learn more about MongoDB’s technology and the potential market in Israel. The total addressable market for databases in Israel is growing rapidly, and MongoDB is only scratching the surface of this market. The extreme potential for MongoDB in this emerging market excited me and drove me to join. JD: Tell us about your experience prior to joining MongoDB. GC: For the last six years I’ve been working for AWS Israel. I joined the team in 2016 when Cloud was a new concept and the market was in its early stages of adoption. I started as an ISV Account Manager and grew to managerial positions as Startup Team Lead and Digital Native Businesses Team Lead. In my last role, I established and led the SMB Segment in AWS Israel and was part of the AWS Israel Leadership team. JD: Why do you believe that Israel is such a strategic and important market for MongoDB? GC: Israel is one of the most prominent innovation and technology hubs in the world, with an economy dominated by industrial high-tech and entrepreneurship. Israel ranks 15th among the 132 economies featured in the Global Innovation Index (GII) 2021. This innovation is reflected in the large number of startups active in Israel, over 7,000 — 14 times more concentration of startups per capita than in Europe. It also attracts the highest rate of venture capital funding per capita in the world. Companies in Israel are tech savvy and early adopters of innovative technologies — such as MongoDB. Many of them are “born in the cloud,” while the more traditional companies are trying to close the gap and are rapidly adopting different solutions that enable a true digital transformation. Most of the Israel technology companies target and sell globally, resulting in a growing market opportunity for MongoDB. I view Israel as a strategic market from three perspectives. First, there is a large business opportunity in the Israeli market that isn’t correlated to the size of the country. All the leading cloud vendors are opening data centers here, which indicates the high potential of the market. Second, given that the Israeli market has early adopters of innovative technologies, this could be a great learning opportunity for other regions on how to work with digital-native and startup businesses. Last, it provides a platform for the MongoDB Product team to gather feedback from tech-savvy users, which can improve our product and shape MongoDB’s future roadmap. JD: What are you most excited about in expanding the Israeli market for MongoDB and continuing to build our sales organization in the country? GC: MongoDB’s technology has a strong reputation in Israel, however I believe it is yet to meet its business potential. The challenge of crafting a go-to-market strategy for Israel, which suits the specific and unique market conditions, excites me. Together with the existing sales teams, I’m looking forward to expanding and growing the sales organization to expedite our ability to cover the market. Finally, I’m really looking forward to evolving the MongoDB Israel office atmosphere, bringing together the MongoDB culture and the warm and friendly Israeli approach. JD: How would you describe your leadership style? GC: My leadership skills, style, and approach have evolved over the years. I believe in having a clear and inspiring vision, empowering the individual, generating value to my team, and leading by example. I tend to challenge the status quo and encourage creativity to explore new ways of doing things. The pandemic has added new challenges to leaders: on the one hand, having to lead remotely with minimal in-person interaction; on the other hand, having difficulty drawing the line between personal and professional as our home and family have been the background scenery of our business meetings. As a leader, I view this change as an opportunity to bring my authentic self and encourage my team to the same. I believe this builds trust, creates a fun environment, and strengthens connections. JD: What opportunities do you see at MongoDB? GC: Careers aren’t linear, and MongoDB offers endless career opportunities in both professional and managerial positions. MongoDB’s global presence offers an exciting addition to these opportunities. While the above might be relevant to other leading tech companies, I think what is unique about MongoDB is the hypergrowth we are experiencing. Working for exponential growth companies requires constant creativity, bias for action, and ownership. Each individual has the opportunity to shine, be successful, and develop new skills daily. The MongoDB Israel office in Tel Aviv was established in 2019. We currently have over 30 employees in customer-facing roles. We have a MongoDB customer support team based in Tel Aviv as well, and they provide support to over 900 customers across all industries and sectors. JD: What has your experience been as a woman in sales, and do you have advice for other women who are looking to build their sales careers and become leaders? GC: I believe in being successful, generating value to my team and business, and the rest follows. The three tips I would give are: Be the best version of yourself and insist on high standards. This positions you as a role model and drives success. Avoid having two versions of yourself—your personal/home version and your work version—because we are the same individual and our power is in the total package. It’s all about people. Leverage your interpersonal skills to drive business — to read into situations, connect to people, and understand the subtext. Our sales team is growing in Israel and across the globe. We have several open roles and would love for you to transform your career with us!

July 21, 2022
Culture

7 Big Reasons to Upgrade to MongoDB 6.0

First announced at MongoDB World 2022, MongoDB 6.0 is now generally available and ready for download now. MongoDB 6.0 includes the capabilities introduced with the previous 5.1–5.3 Rapid Releases and debuts new abilities to help you address more use cases, improve operational resilience at scale, and secure and protect your data. The common theme in MongoDB 6.0 is simplification: Rather than forcing you to turn to external software or third-party tools, these new MongoDB capabilities allow you to develop, iterate, test, and release applications more rapidly. The latest release helps developers avoid data silos, confusing architectures, wasted time on integrating external tech, missed SLAs and other opportunities, and the need for custom work (such as pipelines for exporting data). Here’s what to expect in MongoDB 6.0. 1. Even more support for working with time series data Used in everything from financial services to e-commerce, time series data is critical for modern applications. Properly collected, processed, and analyzed, time series data provide a gold mine of insights — from user growth to promising areas of revenue — helping you grow your business and improve your application. First introduced in MongoDB 5.0, time series collections provide a way to handle these workloads without resorting to adding a niche technology and the resulting complexity. In addition, it was critical to overcome obstacles unique to time series data, such as high volume, storage and cost considerations, and gaps in data continuity (caused by sensor outages). Since its introduction, time series collections have been continuously updated and improved with a string of rapid releases . We began by introducing sharding for time series collections (5.1) to better distribute data, before rolling out columnar compression (5.2) to improve storage footprints, and finally moving on to densification and gap-filling (5.3) for allowing teams to run time series analytics — even when there are missing data points. As of 6.0, time series collections now include secondary and compound indexes on measurements, improving read performance and opening up new use cases like geo-indexing. By attaching geographic information to time series data, developers can enrich and broaden analysis to include scenarios involving distance and location. This could take the form of tracking temperature fluctuations in refrigerated delivery vehicles during a hot summer day or monitoring the fuel consumption of cargo vessels on specific routes. We’ve also improved query performance and sort operations. For example, MongoDB can now easily return the last data point in a series — rather than scanning the whole collection — for faster reads. You can also use clustered and secondary indexes to efficiently perform sort operations on time and metadata fields. 2. A better way to build event-driven architectures With the advent of applications like Seamless or Uber, users have come to expect real-time, event-driven experiences, such as activity feeds, notifications, or recommendation engines. But moving at the speed of the real world is not easy, as your application must quickly identify and act on changes in your data. Introduced in MongoDB 3.6, change streams provide an API to stream any changes to a MongoDB database, cluster, or collection, without the high overhead that comes from having to poll your entire system. This way, your application can automatically react, generating an in-app message notifying you that your delivery has left the warehouse or creating a pipeline to index new logs as they are generated. The MongoDB 6.0 release enriches change streams, adding abilities that take change streams to the next level. Now, you can get the before and after state of a document that’s changed, enabling you to send updated versions of entire documents downstream, reference deleted documents, and more. Further, change streams now support data definition language (DDL) operations, such as creating or dropping collections and indexes. To learn more, check out our blog post on change streams updates . 3. Deeper insights from enriched queries MongoDB’s aggregation capabilities allow users to process multiple documents and return computed results. By combining individual operators into aggregation pipelines, you can build complex data processing pipelines to extract the insights you need. MongoDB 6.0 adds additional capabilities to two key operators, $lookup and $graphlookup , improving JOINS and graph traversals, respectively. Both $lookup and $graphlookup now provide full support for sharded deployments. The performance of $lookup has also been upgraded. For instance, if there is an index on the foreign key and a small number of documents have been matched, $lookup can get results between 5 and 10 times faster than before. If a larger number of documents are matched, $lookup will be twice as fast as previous iterations. If there are no indexes available (and the join is for exploratory or ad hoc queries), then $lookup will yield a hundredfold performance improvement. The introduction of read concern snapshot and the optional atClusterTime parameter enables your applications to execute complex analytical queries against a globally and transactionally consistent snapshot of your live, operational data. Even as data changes beneath you, MongoDB will preserve point-in-time consistency of the query results returned to your users. These point-in-time analytical queries can span multiple shards with large distributed datasets. By routing these queries to secondaries, you can isolate analytical workloads from transactional queries with both served by the same cluster, avoiding slow, brittle, and expensive ETL to data warehouses. To learn more, visit our documentation . 4. More operators, less work Boost your productivity with a slate of new operators, which will enable you to push more work to the database — while spending less time writing code or manipulating data manually. These new MongoDB operators will automate key commands and long sequences of code, freeing up more developer time to focus on other tasks. For instance, you can easily discover important values in your data set with operators like $maxN , $minN , or $lastN . Additionally, you can use an operator like $sortArray to sort elements in an array directly in your aggregation pipelines. 5. More resilient operations From the beginning, MongoDB’s replica set design allows users to withstand and overcome outages. Initial sync is how a replica set member in MongoDB loads a full copy of data from an existing member — critical for catching up nodes that have fallen behind, or when adding new nodes to improve resilience, read scalability, or query latency. MongoDB 6.0 introduces initial sync via file copy, which is up to four times faster than existing, current methods. This feature is available with MongoDB Enterprise Server. In addition to the work on initial sync, MongoDB 6.0 introduces major improvements to sharding, the mechanism that enables horizontal scalability. The default chunk size for sharded collections is now 128 MB, meaning fewer chunk migrations and higher efficiency from both a networking perspective and in internal overhead at the query routing layer. A new configureCollectionBalancing command also allows the defragmentation of a collection in order to reduce the impact of the sharding balancer. 6. Additional data security and operational efficiency MongoDB 6.0 includes new features that eliminate the need to choose between secure data or efficient operations. Since its GA in 2019, client-side field-level encryption (CSFLE) has helped many organizations manage sensitive information with confidence, especially as they migrate more of their application estate into the public cloud. With MongoDB 6.0, CSFLE will include support for any KMIP-compliant key management provider. As a leading industry standard, KMIP streamlines storage, manipulation, and handling for cryptographic objects like encryption keys, certificates, and more. MongoDB’s support for auditing allows administrators to track system activity for deployments with multiple users, ensuring accountability for actions taken across the database. While it is important that auditors can inspect audit logs to assess activities, the content of an audit log has to be protected from unauthorized parties as it may contain sensitive information. MongoDB 6.0 allows administrators to compress and encrypt audit events before they are written to disk, leveraging their own KMIP-compliant key management system. Encryption of the logs will protect the events' confidentiality and integrity. If the logs propagate through any central log management systems or SIEM, they stay encrypted. Additionally, Queryable Encryption is now available in preview. Announced at MongoDB World 2022, this pioneering technology enables you to run expressive queries against encrypted data — only decoding the data when it is made available to the user. This ensures that data remains encrypted throughout its lifecycle, and that rich queries can be run efficiently without having to decrypt the data first. For a deep dive into the inner workings of Queryable Encryption, check out this feature story in Wired . 7. A smoother search experience and seamless data sync Alongside the 6.0 Major Release, MongoDB will also make ancillary features generally available and available in preview. The first is Atlas Search facets , which enable fast filtering and counting of results, so that users can easily narrow their searches and navigate to the data they need. Released in preview at MongoDB World 2022 , facets will now include support for sharded collections. Another important new addition is Cluster-to-Cluster Sync , which enables you to effortlessly migrate data to the cloud, spin up dev, test, or analytics environments, and support compliance requirements and audits. Cluster-to-Cluster Sync provides continuous, unidirectional data synchronization of two MongoDB clusters across any environment, be it hybrid, Atlas, on-premises, or edge. You’ll also be able to control and monitor the synchronization process in real time, starting, stopping, resuming, or even reversing the synchronization as needed. Ultimately, MongoDB 6.0’s new abilities are intended to facilitate development and operations, remove data silos, and eliminate the complexity that accompanies the unnecessary use of separate niche technologies. That means less custom work, troubleshooting, and confusing architectures — and more time brainstorming and building. MongoDB 6.0 is not an automatic upgrade unless you are using Atlas serverless instances. If you are not an Atlas user, download MongoDB 6.0 directly from the download center . If you are already an Atlas user with a dedicated cluster, take advantage of the latest, most advanced version of MongoDB. Here’s how to upgrade your clusters to MongoDB 6.0 .

July 19, 2022
News

Change Streams in MongoDB 6.0 Support Pre- and Post-Image Retrieval, DDL operations, and more

Introduced with MongoDB 3.6, a MongoDB change stream is an API on top of the operations log (oplog) that allows users to subscribe their applications to data changes in a collection, database, or entire deployment. It makes it easy for teams to build event-driven applications or systems on MongoDB that capture and react to data changes in near real time — no middleware or database polling scripts required. For MongoDB 6.0, we have enhanced change streams with new functionality that addresses a wider range of use cases while improving performance. Change streams now allow users to easily retrieve the before and after state of an entire document — sometimes referred to as pre- and post-images, respectively — when a document is either updated or deleted. Let’s suppose that you are storing user sessions in a collection and using a time-to-live (TTL) index to delete sessions as they expire. You can now reference data in the deleted documents to provide more information to the end user about their session after the fact. Or maybe you need to send an updated version of the entire document to a downstream system each time there is a data change. Added support for retrieving the before and after states of a document greatly expands the use cases change streams can address. Prior to MongoDB 6.0, change streams only supported data manipulation language (DML) operations. Change streams in MongoDB 6.0 will now support data definition language (DDL) operations such as creating and dropping indexes and collections so you can react to database events in addition to data changes. Change streams are built on MongoDB’s aggregation framework, which gives teams the capability to not only capture and react to data changes, but also to filter and transform the associated notifications as needed. With MongoDB 6.0, change streams that leverage filtering will have those stages automatically pushed to the optimal position within a change stream pipeline, dramatically improving performance. We’re excited to announce these enhancements to change streams with MongoDB 6.0 and look forward to seeing and hearing about all the applications and systems you’ll build with this expanded feature set. To learn more, visit our docs .

July 19, 2022
Updates

Ready to get Started with MongoDB Atlas?

Start Free