We’re excited to announce a new feature for Monitoring in both Cloud Manager and Atlas: The Query Targeting Chart. This chart tracks two variables, the first is “scanned/returned” and the second is “scanned objects/returned”.
“Scanned/returned” refers to the ratio between the number of index items scanned and the number of documents returned by queries. If this value is 1.0, then your query scanned exactly as many index items as documents it returned – it’s an efficient query. This is available for MongoDB 2.4 and newer.
“Scanned objects/returned” is similar, except it’s about the number of documents scanned versus the number returned. A large number is a sign that you may need an index on the fields you are querying on. This metric is available for MongoDB 2.6 and newer.
For a little more understanding of this graph, let’s talk about a collection with 1000 documents in it. We then issue a query without an index (so it is a collection scan). Scanned objects/returned for this query could be as bad as 1000, but the average value would be 500. Now, let’s put an index on that same query, return one document and we only have scanned one document. This means that scanned/returned is 1, and scanned objects/returned is also 1. Finally, let’s say you do a covered query, in this case the scanned/returned is 1, but the scanned objects is 0, because the index has all the data you requested, so you didn’t need to query any objects!
This feature is available for all Cloud Manager and Atlas deployments. We believe this new chart will help you refine your queries and indexes to get the best performance out of your MongoDB deployment. However, if you need more help, the Visual Profiler as part of Cloud Manager Premium can help you identify slow queries and suggest indexes as well. Contact your Account Executive for more information about MongoDB subscriptions with access to Cloud Manager Premium.
Peter C. Gravelle is a Technical Account Manager at MongoDB, Inc. He can be found via Atlas’ chat option as well as in tickets. He can also be found in New York City.
Building Applications with MongoDB's Pluggable Storage Engines: Part 1
This is the first in a two post series about MongoDB’s pluggable storage engines. This post discusses characteristics of MongoDB’s storage engines. **Introduction** With users building increasingly complex data-driven apps, there is no longer a "one size fits all" database storage technology capable of powering every type of application built for the enterprise. Modern applications need to support a variety of workloads with different access patterns and price/performance profiles – from low latency, in-memory read and write applications, to real time analytics, to highly compressed "active" archives. Through the use of pluggable storage engines, MongoDB can be extended with new capabilities, and configured for optimal use of specific hardware architectures. This approach significantly reduces developer and operational complexity compared to running multiple database technologies. Storage engines can be mixed in the same replica set or sharded cluster. Users can also leverage the same MongoDB query language, data model, scaling, security and operational tooling across different applications, each powered by different pluggable MongoDB storage engines. **Figure 1:** Mix and match storage engines within a single MongoDB replica set MongoDB 3.2 ships with four supported storage engines that can be optimized for specific workloads: The default WiredTiger storage engine. For most applications, WiredTiger's granular concurrency control and native compression will provide the best all-around performance and storage efficiency. The Encrypted storage engine, protecting highly sensitive data, without the performance or management overhead of separate file system encryption. The Encrypted storage engine is based upon WiredTiger and so throughout this whitepaper, statements regarding WiredTiger also apply to the Encrypted storage engine. This engine is part of MongoDB Enterprise Advanced . The In-Memory storage engine for applications that have extremely strict SLAs for consistent and predictable low latency, while not requiring disk durability for the data. This engine is part of MongoDB Enterprise Advanced . The MMAPv1 engine, an improved version of the storage engine used in pre-3.x MongoDB releases. MMAPv1 was the default storage engine in MongoDB 3.0. MongoDB allows users to mix and match multiple storage engines within a single MongoDB cluster. This flexibility provides a simple and reliable approach to support diverse workloads. Traditionally, multiple database technologies would need to be managed to meet these needs, with complex, custom integration code to move data between technologies, and to ensure consistent, secure access. With MongoDB’s flexible storage architecture, the database automatically manages the movement of data between storage engine technologies using native replication. This approach significantly reduces developer and operational complexity when compared to running multiple distinct database technologies. **Table 1:** Comparing the MongoDB WiredTiger, In-Memory, Encrypted, and MMAPv1 storage engines **WiredTiger Storage Engine** MongoDB acquired WiredTiger in 2014, and with it the experts behind the WiredTiger storage engine: co-founders Keith Bostic (founder of Sleepycat Software) and Dr. Michael Cahill, and their colleagues. Bostic and Cahill were the original architects of Berkeley DB, the most widely-used embedded data management software in the world, and have decades of experience writing high performance storage engines. WiredTiger leverages modern hardware architectures and innovative software algorithms to provide industry-leading performance for the most demanding applications. WiredTiger is ideal for wide range of operational applications and is therefore MongoDB’s default storage engine. It should be the starting point for all new applications, with the exception of cases where you need the specific capabilities of the In-Memory or Encrypted storage engines. The key advantages of WiredTiger include: Maximize Available Cache: WiredTiger maximizes use of available memory as cache to reduce I/O bottlenecks. There are two caches that are used: the WiredTiger cache and the filesystem cache. The WiredTiger cache stores uncompressed data and provides in-memory-like performance. The operating system’s filesystem cache stores compressed data. When data is not found in the WiredTiger cache, WiredTiger will look for the data in the filesystem cache. **Figure 2:** WiredTiger Caches (WiredTiger Cache and FS Cache) Data found in the filesystem cache first goes through a decompression process before moving to the WiredTiger cache. The WiredTiger cache performs best when it holds as much of the working set as possible. However, it is also important to reserve memory for other processes that need it such as the operating system, including the filesystem cache. This also includes MongoDB itself, which as a whole will consume more memory than what is in active use by WiredTiger. MongoDB defaults to a WiredTiger cache size of approximately 60% of RAM . The minimum amount to leave the filesystem cache is at 20% of available memory. Anything lower and the operating system may be constrained for resources. High Throughput: WiredTiger uses “copy on write” — when a document is updated WiredTiger will make a new copy of the document and determine the latest version to return to the reader. This approach allows multiple clients to simultaneously modify different documents in a collection, resulting in higher concurrency and throughput. Optimum write performance is achieved when an application is utilizing a host with many cores (the more the better), and multiple threads are writing to different documents. Reducing Storage Footprint and Improving Disk IOPs: WiredTiger uses compression algorithms to reduce the amount of data stored on disk. Not only is storage reduced, but IOPs performance is increased as fewer bits are read from or written to disk. Some types of files compress better than others. Text files are highly compressible, while binary data may not be as compressible since it may already be encoded and compressed. WiredTiger does incur additional CPU cycles when using compression, but users can configure compression schemes to optimize CPU overhead vs. compression ratio. Snappy, which is the default compression engine, provides good balance between high compression ratio with low CPU overhead. Zlib will achieve higher compression ratios, but incur additional CPU cycles. Compression (Indexes and Journals): Indexes can be compressed in memory as well as on disk. WiredTiger utilizes prefix compression to compress the indexes, conserving RAM usage as well as freeing up storage IOPs. Journals are compressed by default with Snappy compression. Multi-Core Scalability: As CPU manufacturers shrink to smaller lithographies and power consumption becomes more and more of an issue, processor trends have shifted to multi-core architectures in order to sustain the cadence of Moore’s law. WiredTiger was designed with modern, multi-core architectures in mind, and provides scalability across multi-core systems. Programming techniques such as hazard pointers, lock free algorithms, and fast latching minimize contention between threads. Threads can perform operations without blocking each other — resulting in less thread contention, better concurrency, and higher throughput. Read Concern: WiredTiger allows users to specify a level of isolation for their reads. Read operations can return a view of data that has been accepted or committed to disk by a majority of the replica set. This provides a guarantee that applications only read data that will persist in the event of failure and won’t get rolled back when a new replica set member is promoted to primary. For more information on migrating from MMAP/MMAPv1 to WiredTiger here is the documentation . **Encrypted Storage Engine** Data security is top of mind for many executives due to increased attacks as well as a series of data breaches in recent years that have negatively impacted several high profile brands. For example, in 2015, a major health insurer was a victim of a massive data breach in which criminals gained access to the Social Security numbers of more than 80 million people — resulting in an estimated cost of $100M. In the end, one of the critical vulnerabilities was the health insurer did not encrypt sensitive patient data stored at-rest. Coupled with MongoDB’s extensive access control and auditing capabilities, encryption is a vital component in building applications that are compliant with standards such as HIPAA, FERPA, PCI, SOX, GLBA, ISO 27001, etc. The Encrypted storage engine is based on WiredTiger, and thus is designed for operational efficiency and performance: Document level concurrency control and compression Support for Intel’s AES-NI equipped CPUs for acceleration of the encryption/decryption process As documents are modified, only updated storage blocks need to be encrypted rather than the entire database With the Encrypted storage engine, protection of data at-rest is an integral feature of the database. The raw database “plaintext” content is encrypted using an algorithm that takes a random encryption key as input and generates ciphertext that can only be decrypted with the proper key. The Encrypted Storage Engine supports a variety of encryption algorithms from the OpenSSL library. AES-256 in CBC mode is the default, while other options include AES-256 in GCM mode, as well as FIPS mode for FIPS-140-2 compliance. Encryption is performed at the page level to provide optimal performance. Instead of having to encrypt/decrypt the entire file or database for each change, only the modified pages need to be encrypted or decrypted, resulting in less overhead and higher performance. Additionally, the Encrypted Storage Engine provides safe and secure management of the encryption keys. Each encrypted node contains an internal database key that is used to encrypt or decrypt the data files. The internal database key is wrapped with an external master key, which must be provided to the node for it to initialize. To ensure that keys are never written or paged to disk in unencrypted form, MongoDB uses operating system protection mechanisms, such as VirtualLock and mlock , to lock the process’ virtual memory space into memory. There are two primary ways to manage the master key: through an integration with a third party key management appliance via the Key Management Interoperability Protocol (KMIP) or local key management via a keyfile. Most regulatory requirements mandate that the encryption keys be rotated and replaced with a new key at least once annually. MongoDB can achieve key rotation without incurring downtime by performing rolling restarts of the replica set. When using a KMIP appliance, the database files themselves do not need to be re-encrypted, thereby avoiding the significant performance overhead imposed by key rotation in other databases. Only the master key is rotated, and the internal database keystore is re-encrypted. It is recommended to use a KMIP appliance with the Encrypted storage engine. **In-Memory Storage Engine** In modern applications, different subsets of application data have different latency and durability requirements. The In-Memory storage engine option is created for applications that have extremely strict SLAs even at 99th percentiles. The In-Memory engine will keep all of the data in memory, and will not write anything to disk. Data always has to be populated on start-up, and nothing can be assumed to be present on restart, including application data and system data (i.e users, permissions, index definitions, oplog, etc). All data must fit into the specified in-memory cache size. The In-Memory storage engine combines the predictable latency benefits of an “in memory cache” with the rich query and analytical capabilities of MongoDB. It has the advantage of using the exact same APIs as any other MongoDB server so your applications do not need special code to interact with the cache, such as handling cache invalidation as data is updated. In addition, a mongod that's configured with the In-Memory storage engine can be part of a replica set, and thus can have another node in the same replica set backed by fast persistent storage. The In-Memory engine is currently supported on MongoDB 3.2.6+. For performance metrics on the In-Memory storage engine view the MongoDB Pluggable Storage Engine white paper . For applications requiring predictable latencies, the In-Memory engine is the recommended storage engine as it provides low latency while also minimizing tail latencies resulting in high performance and a consistent user experience. Some of the key benefits of the In-Memory engine: Predictable and consistent latency for applications that want to minimize latency spikes Applications can combine separate caching and database layers into a single layer— all accessed and managed with the same APIs, operational tools, and security controls Data redundancy with use of a WiredTiger secondary node in a replica set **MMAPv1 Storage Engine** The MMAPv1 engine is an improved version of the storage engine used in pre 3.x MongoDB releases. It utilizes collection level concurrency and memory mapped files to access the underlying data storage. Memory management is delegated to the operating system. This prevents compression of collection data, though journals are compressed with Snappy. In the second part of this blog series, we will discuss how to select which storage engine to use. Learn more about MongoDB’s pluggable storage engines. Read the whitepaper. Pluggable Storage Engine Architecture About the author - Jason Ma Jason Ma is a Principal Product Marketing Manager based in Palo Alto, and has extensive experience in technology hardware and software. He previously worked for SanDisk in Corporate Strategy doing M&A and investments, and as a Product Manager on the Infiniflash All-Flash JBOF. Before SanDisk, he worked as a HW engineer at Intel and Boeing. Jason has a BSEE from UC San Diego, MSEE from the University of Southern California, and an MBA from UC Berkeley.
4 Steps to Success: From Surviving with Legacy Systems to Thriving with MongoDB
Legacy data migrations imply a change in the status quo. More often than not, when an organization finally undertakes a thorough analysis of its technology landscape, it arrives at the same decision: to do nothing. It is an understandably daunting task to upgrade or replace 20+ year-old applications and their database counterparts. But there are good reasons, beyond the tri-annual hardware upgrade, to propel those legacy monoliths of the 1990s into the 21st century. Companies that prevailed—and even triumphed—in the volatile spring of 2020 were those that transitioned to a more flexible usage model and were therefore able to adjust their business models more rapidly and reliably. MongoDB’s client, Sanoma, was one of the winners. Sanoma was able to scale from 3,000 to 150,000 users within 24 hours, without any service interruption. Innovation and modernization go hand in hand. However, while modernization can sadly occur without innovation, the opposite is simply not possible. A bit of history The concept of bringing data together through online data layers (ODL) or operational data stores (ODS) isn't new or specific to MongoDB. Accessing legacy systems, bringing data together, and making it all more easily accessible was a common goal even 20 years ago, and led to the search for the golden source of truth (i.e. the definitive master source for any given entity). This search proved elusive early on due to the hurdles involved with bringing data from diverse, over-structured relational constructs to a sole target called Operational Data Store (ODS) or Online Data Layer (ODL). The industry’s first attempts began with Object-oriented databases, then with the dead end of XML data stores. (In my personal opinion, Xquery and Xpath were never meant for real developers). After both endeavors failed, then came the wave of Apache efforts I like to call “Hadoop Solves the Planet,” in which companies dumped all their structured data onto a big-data treasure trove. Unfortunately, this resulted in a data desert rather than the data lake everybody was hoping for, since organizations then had to scramble to build a concept for secondary indexing, data dictionaries, and more, on top of having to rebuild the sensible structures they lost. In the 2010s, the document model, in conjunction with JSON notation , emerged as the new de facto standard. MongoDB release 3.x introduced the combination of ACID (atomicity, consistency, isolation, durability) and compliance with a broad range of data types (in BSON, for those in the know). Soon, the MongoDB team started implementing additional features of relational heritage: secondary indexing, ACID transactions, aggregations and manipulations of data in site, materialized views, joins, unions... the list goes on. Where we are now MongoDB documents can be enriched through different means and channels without touching the content — the consistency of all data and data lineage is implicitly guaranteed. A typical example is the extraction of a delivery address through a supply chain application and a billing address through an enterprise resource planning system. In many cases, those two systems have different requirements. MongoDB documents simply keep both instantiations intact and can even hold multiples of each attached to one single client profile without the need to complete loads and transformations, foreign keys, and all the other ingredients of the relational past. MongoDB simply adds and leverages other sources without destroying their context. MongoDB delivers an ODS and ODL experience while streamlining the time-consuming journey of replacing legacy application code.The data platform of true modernization and innovation has arrived! How your company can get here The entire journey can be summarized in four simple steps: Analysis: Where do I start my data journey to drive the fastest value? Scaffolding: How do I get my data out of the existing platform and bridge it to the new platform? Coding: How do I enter the world of adjusting and adapting my applications landscape? Innovation: Which are the easiest targets for my company to start achieving true innovation? The following sections answer these four questions and provide you with a starting point for your journey toward a new and improved solution landscape. Step 1: Analysis of your existing solution landscape Data Provisioning Data provisioning—the act of bringing data from source system(s) to target system—is actually the easy part of this step. Opinions may vary as to the very best approach, but most existing models for streaming data in real time make the process elegant and allow for a business-driven decision from real-time replication on one end to communicate with the batch of .CSV files on the other end. Application onboarding More exciting is the application onboarding phase, inclusive of the selection and design of initial data domains. Here, simple mechanisms derived from the classic priority concepts can assist—and yes, they existed long before computers. Data domains already exist in objects in the business logic represented through their objects in the various programming languages. But even the most talented application developer deals with constant changes which leads to compromises in those objects and can obfuscate the original clarity in their design so the objects may hide in plain sight. Unearthing those gems and aligning them to the ODS is the most important step towards true legacy modernization. The most simple solution is actually the most practical one: load an object with the existing software and persist it into a MongoDB collection. The effort of persisting the object results in two lines of code that can be easily added. The location of the two lines of code (first line one opens connection to database; second line one persists the object) does not matter as long as it is in a place after the object is built out. This is the first time you will see the beauty of MongoDB and MQL at work. You really have to do nothing for the object itself—e.g. no decomposition or abstraction layer. MongoDB takes care of it for you. When looking at the object in the MongoDB database, e.g. using MongoDB Compass, you will realize that it already looks a lot like the domain object you wanted. The actual task to map objects to domains, or subset of domains, is now mostly driven by the application use case. Tip: How to leverage application mapping to accelerate onboarding In the model below, which was taken from the financial industry but can easily be adopted across industries, we identify the data domains in various applications and map their behavior to the effort it takes to locate them as well as their importance to the app. First, each domain gets a rating for its object complexity, where “complexity” is defined by the implementation team. This is similar to the concept of “ poker ” in a development sprint. Second, each data domain must be located in the application content. Then, it’s tally time. As we can see in the example above, the concept of schedules looks quite easy but is superseded by the client profiles which have a touch more application context (spoiler: those always come out on top). Based on the combination of complexity and the number of data domains affecting an application, we can now easily achieve the model below. Agile is your friend and, assuming a certain “point capacity,” the applications fall into place for their conversion schedule in a quite neutral fashion. The development team will then start with low hanging fruit. As soon as application 1, 6 and 7 are ported, we’re in business in a new modern landscape. Along the journey, the domains will get cleaned up naturally as we do not have the static corsage of the RDBMS table designs. Step 2: Scaffolding Scaffolding is the art of building a bridge that can hold people as they cross it, then immediately dissipate once they step off. But for that critical time, it needs to hold. The same is true for the connectivity between a legacy system and a new data platform. Starting with the first sprint, we have data residing in the MongoDB data platform. If the data is limited to new applications and resides exclusively in MongoDB, nothing needs to be done. However, as shown in the client profiles example above, there may be dependencies to consider. The synchronization between the legacy database and the new MongoDB platform can be easily arranged using microservices and the same concepts used for the initial loading of data. Synchronization can also be achieved through “the gate” if only READ data is needed during the first sprint, or if you’re already dealing with WRITE and the requirement to synchronize those writes back to a legacy system. Streaming: A streaming based solution is a great option for uni-directional operations that allow read only in the most simple way. Service: Selecting a simple, tiny microservice is a good option for the use case where data needs to be selectively written. It works using the document model on the MongoDB side, but can still push necessary updates back to the legacy system, and vice-versa. The great news is that this service potentially exists already, as it requires nothing more than using the old database interface from the legacy application on one side and the new, easy-to-digest JSON document format on the MongoDB side. If both databases are ACID-compliant, any transaction is automatically treated as a normal application interaction on both sides. “Y-Loader”: Another option is a true “Y-loader,” where all transactions are written in sync to both databases in parallel, and the actual transaction is only considered committed when both systems report their commit and completion. Simple two-phase protocols (write to both, wait five seconds, read both to validate and, if in sync, commit to application) are available as ready-made services through various distributed transaction coordinators, but often it’s easier to use the existing data access in the application. In that case, the new data path to MongoDB is in parallel, and a simple redundant checkpoint (which the application logic would have had for the legacy path anyway) is expanded for this purpose. Step 3: Coding The coding with the new domain data model, as well as the MongoDB flexible document model as the underlying base, will immediately impact the coding for the business logic and application development. The operative word is immediately. As the data gets unlocked with the initial persistence of the code object to the MongoDB collection, the developer is simultaneously able to code based on business requirements. Developers will no longer be hindered by reference and requirements of object mappers. As the objects are represented through the MongoDB idiomatic drivers, each programming object resides directly in the data collection; in reverse, any changes to the business logic object will be naturally represented—code-free—in the MongoDB collection. A single blog post can't resolve all open questions and edge cases. Each application, client, and data interface is unique. Databases possess historic technical debt and implicit assumptions that become lost in generations of developers over time. “Do not touch this section—not sure what it does but last time we tried all hell broke loose…” is often-heard advice around the organizational water cooler. But the key lesson? There are many different templates available and very simple methods of quickly taking the lead to significant success. For example, a German client, who was stuck in a combination of IBM DB2 (mainframe and distributed) with a significant Hadoop footprint, was amazed when they realized they could “lift” their data one microservice at a time. This resulted in business requirements shifting from “impossible to do” for some requested queries to “completed in under one second” within a single week of a proof-of-concept. This is no exception. Cases and changes like these are made daily, reinforcing Mark Twain’s sage advice that “The secret of getting ahead is getting started." Step 4: Innovation As the migration from the legacy environment continues, innovation will be the new focus. The unlocking of previously siloed data allows immediate coupling of real-time data with machine learning platforms for various purposes: e.g. scoring for financial decision-making, personalization for retail, or optimization of production processes in the IOT context. New applications and solutions can easily be created on top of the unleashed data, even with various programming languages, direct real-time dashboards created with MongoDB Charts, and different paradigms (again, MongoDB’s idiomatic drivers do magic!) At this time, the discussion with the product owners in your squads and tribes (trying to be real modern here) begins with the question“What is the highest priority component to change?” and “What function is required to enable this change?” Is it worth waiting much longer? The real question is: why did we all not start sooner? It’s time to begin integrating the list of features you always dreamed of having, but never dared to pursue. The MongoDB team is here to help you get started. Reach out today and let’s discuss the best path forward. To learn more about modernizing to MongoDB, click here .