MongoDB Blog

Articles, announcements, news, updates and more

AWS and MongoDB: Partners in Reliable, Resilient Cloud Environments

Security is increasingly critical for application development. While the volume of applications developed, distributed, used, and patched over networks is rapidly expanding, so, too, are cyberattacks and data breaches, many of which happen at the web application layer. As more organizations move to the cloud, it’s imperative for customers to know who’s responsible for what when it comes to security. Understanding these roles and responsibilities is crucial for ensuring cloud workloads remain secure and available. MongoDB and AWS are working together to simplify and strengthen data security for our customers so they can focus on developing great applications and user experiences. For more information on shared responsibility, read the first blog in this series . Shared responsibility in the cloud Back when most IT environments lived on premises, the responsibility of securing the systems and networked devices fell squarely on the owner of the assets — usually the business owner or a managed service provider. Today, with the prevalence of cloud applications, hybrid environments, and pay-as-you-go services, it is often not clear who's responsible for what when it comes to securing those environments, services, and the data they contain. For this reason, the shared responsibility model of cloud security has emerged. Under the shared responsibility model, some security responsibilities fall on the business, some on public cloud providers, and some on the vendors of the cloud services being used. When you deploy a MongoDB Atlas database on AWS, the database is created on infrastructure operated, managed, and controlled by AWS, from the host operating system and virtualization layer down to the physical security of the AWS data centers. MongoDB is responsible for the security and availability of the services we offer — and for everything within the scope of our responsibilities as a SaaS vendor. Customers are responsible for the security of everything above the application layer — accounts, identities, devices, and data — plus the management of the guest operating system, including updates and security patches; associated application software; and the configuration of the AWS-provided security group firewall. (See Figure 1.) Figure 1.   Shared responsibility when using MongoDB Atlas. Strategic partners in data solutions MongoDB Chief Information Security Officer Lena Smart delivered a keynote at AWS re:Inforce , an event where security experts offered tips and best practices for securing workloads in the cloud, and was also interviewed by theCUBE . Smart noted how MongoDB and AWS are working together to enable our joint customers to focus more on business objectives while having the confidence in the cloud services and infrastructure they get from us. "You want to worry less about security so that you can focus on application development, performance, availability, business continuity, data management, and access," Smart said. "As the CISO of MongoDB, these concerns are also my top concerns as we work to better serve our global customer base. And we are very appreciative of the opportunity to do this in lockstep with AWS." Jenny Brinkley, Director, AWS Security, agrees that customers stand to benefit through the shared responsibility model. "The shared responsibility model is a huge reason why more customers are deploying in the cloud," Brinkley said. "AWS, combined with marketplace services like MongoDB Atlas, help relieve the customer's operational burden so they can focus on driving their businesses forward." Smart's appearance at the event is just one example of how MongoDB and AWS are working together to deliver scalable data intelligence solutions for enterprise data in the cloud, reduce risk for cloud-native tools, and enable our joint customers to achieve compliance and protect their sensitive data. Thanks to our strategic partnership, organizations around the globe and across a wide range of industries — from banking and airlines to insurance and e-commerce — are better able to discover, manage, protect, and get more value from their regulated, sensitive, and personal data across their data landscape. MongoDB Atlas is trusted by organizations with highly sensitive workloads because it is secure by default. We're constantly innovating with new, breakthrough technologies, like our industry-first queryable encryption, which allows customers to run rich, expressive queries on fully randomized encrypted data, improving both the development process and the user experience. MongoDB Atlas is designed to be secure by default. Try it for free . MongoDB Atlas (Pay as You Go) is now available in AWS Marketplace - Try it today .

August 11, 2022
Applied

4 Common Misconceptions About Security That Hackers Hope You Don't Know

I’ve always thought hacking was harder than it looks on TV. But after two years exploring the world of ethical hacking, I’ve learned that almost anyone can hack almost anything if they have sufficient knowledge and think creatively. Practically every week, we hear about another data breach at a large organization or a new vulnerability in a widely used software package. Many of these exploits are the result of misconceptions about security, which often lead to security misconfigurations. In this post, I'll cover the four most common security misconceptions and explain how hackers leverage them to execute attacks. I'll also explain how MongoDB can help you protect your data against these attacks. Watch Dawid Esterhuizen's MongoDB World 2022 presentation, Hack the MongoDB Planet! 1. NoSQL Injection The first rule of secure web development is: Never trust user input. This brings me to the first common misconception: MongoDB doesn’t use SQL, so I don’t have to worry about SQL injection. I’ve spoken with countless developers who have this false sense of security. The truth is, as NoSQL databases gain an increasing share of the database market, attackers are starting to pay attention, and new exploit tools are starting to emerge, like NoSQLmap (a NoSQL version of SQLmap), which allows hackers to perform automated NoSQL injections on vulnerable applications and endpoints. But enough theory. Let’s start hacking (ethically). Shown below is an example of a vulnerable application endpoint. The application uses the req.query.user and req.query.pass , which were provided by the user, as the values for name and password. The query string takes the username and password, builds a query, and checks to see if the user exists. app.get('/', async (req, res) => { query = {name: req.query.user, password: req.query.pass} user = await client.db("secret_bank").collection("users").findOne(query) if (user){ res.send('Success') } else { res.send('Failed') } } Vulnerable application endpoint. This is a fairly basic authentication method, but one that can be easily exploited with some command line tools. If you enter the correct username and password information, you can see the presence of strings on the server. (I’ve added some extra logging so you can see what happens on the server.) Normal login. The username and password are confirmed on the server and everything seems normal. MongoDB queries are usually built with objects and arrays. The question from an attacker’s point of view is, can they inject those onto the server? One of the interesting and not-very-well-known features of URL query parameters is that if you add a square bracket to a parameter in the URL, it is converted into an array. As you can see from this example, I added the password, notRight , as an array on the server. Add square brackets to the parameter to see the presence of an array. Say you’re an attacker and you want to inject a list of two values. You just have to do it twice with the same parameter and you’ll get an array list of values on the server. If you add square brackets to the same parameter twice in the command line, you'll get an array list of values on the server. This would definitely be useful for an attacker seeking to inject NoSQL into the application. The next question an attacker might ask is, can they inject an object on the server? In any programming language, if you set a field name inside the square brackets you get an object. Here, you see the same with an object on the server. If you set the field name inside square brackets, you'll see the object on the server. So the attacker can now send objects to the server, but how do they exploit that? In an SQL injection, the most basic authentication bypass you can do is quote (") or one equals one (1=1) double dash. This query results in true every time. Bypass login in SQL. The SQL query above will evaluate if the user is equal to admin, which in our case is true. Then it checks if the password is equal to an empty string (due to our quote), or if one is equal to one, which is always true. The query will therefore always return the ID of the username if it exists. There are a few NoSQL options that will do the same. If your password is actually set, you can run a query that checks if it's greater than nothing, which it always is, and use that to bypass authentication. Similar to an SQL injection, there are a few NoSQL queries hackers can run to bypass application authentication. This is a vulnerability in the application, not the database. The application assumes that the user is sending a string for the password and user. If the application forces the password and username to a string with .toString() , no matter what input the user gives, it is not parsed. A lot of this is pretty obvious, but let's take it a step further. Shown below are a few lines of Python script that iterate over some characters and play a little with regex, and then do some blind NoSQL injection. By doing that, I'll demonstrate how we can pull some data out of the database. baseURI="http://localhost:3000/?user=admin&pass[$regex]=^" chars=list(string.ascii_lowercase + string.ascii_uppercase + string.digits) count=0 while (not found): currentCharacter=chars[count] url=baseURI + foundString + currentCharacter + "" if requests.get(url).text == "Success": count=0 foundString+=currentCharacter url=baseURI+foundString+"$" found = requests.get(url).text == "Success": else: count+=1 Once we kick off the exploit, the script continuously iterates off every character. This Python script performs a blind NoSQL injection to extract a password. The main objective here is to extract the clear text password, since that was our vulnerable parameter in this case. This exploit is 13 lines of code that took about 10 minutes to write. With NoSQLmap, this could be even easier. Now, I know all you Java and C# developers are out there thinking your applications are safe. You'll be surprised to know that you're still vulnerable. All it takes is a very interesting approach to query building by parsing JSON strings as queries. In 2021, a quite popular open source chat server identified this exploit, using regex blind injection. The attackers in this case pulled out the reset tokens for users. Once they were able to reset a user’s password, they escalated the user’s privileges, making them an admin. Once an admin, they could set up integration, which is a fancy word for running code on your server. And once that happens, you've been pwned. So it bears repeating: Never trust any input from any user, ever! 2. Social engineering attacks People and businesses get hacked every day, and most security attacks and data breaches use social engineering. Great hackers will only let you know about their presence when they want you to know about it. This brings me to the second common misconception: I don't need authentication because the database will always be running on a private network. Authentication is the most basic security feature. Whether you're using SCRAM-SHA, LDAP, x509, or Kerberos, if you do not lock the front door, someone will gain access, and social engineering is the preferred method. All it takes is someone innocently clicking on a malicious file, which results in a macro trojan that gives the attacker access to the network. Once they're in your network, they can scan for open ports, then test if authentication is enabled for the database. Network scan. Once they find a database without authentication, you've been pwned. You've been pwned. It's that simple. 3. TLS and network traffic Let's say you have now enabled authentication, your data is now behind a lock and a strong password (at least 16 characters using both uppercase and lowercase letters, numbers, and symbols). You have even moved credentials into a secrets vault to stop attackers from reading them out of the source code or config files. You're safe, right? Well, this brings me to the third common misconception: We don't need TLS, MongoDB uses a binary protocol. Unfortunately, with the popularity of bring-your-own-device options, employees often install software from unofficial sources without any permission or oversight from IT. Vulnerabilities in third-party software can easily lead to unauthorized access to the network by a malicious actor. Once again, they can perform a scan and check to see if authentication is set up, and it is. Checking authentication. At this point, they can attempt a brute force attack, but that's a lot of work, and hackers prefer easier approaches. Instead, they start sniffing the network and find multiple packets for the MongoDB database. Sniffing the network. They can intercept the traffic and send it to their own machine. Once they have the data on their machine, and if they know their tools well (Tshark or Wireshark would help), they can access a PCAP file. And if they output it as JSON, they can use jq to manipulate it. The example below shows BSON as hexadecimal code. BSON dump. The CyberChef tool has a decoder for BSON. Decoding BSON. No username is required. And once again, you've been pwned. The lesson here is that TLS should always be used to ensure any data that is transferred between two systems is encrypted and, even if intercepted, cannot be decrypted without the private key. Retrieving files after they've been deleted So now you have authentication with strong passwords, the application is safe, the network is safe, and you’re using encryption at rest. You’re totally secure, right? Nope. This brings me to the fourth common misconception: Encryption at rest will protect me against hackers. A 2014 study found that 78% of drives that were sold for reuse after decommissioning still had data on them, 23% of those still had social security numbers, and 21% had financial information. This is a problem. If you're using a keyfile for encryption at rest, hackers can retrieve those files from an old, discarded drive or one that was not securely wiped. And once again, you've been pwned. The best defense against this sort of vulnerability is to always use a KMIP (Key Management Interoperability Protocol) server for encryption at rest. Malicious insiders It's important to remember that if an attacker gains entry to your system while it's running, they can get ahold of your data, no matter what system you're using. This brings me to an attack vector that is one of the biggest risks for businesses right now: malicious insiders. According to a recent study , 70% of organizations are seeing an increase in insider attacks. It can take up to 200 days to identify these attacks. Around the globe, 35% of businesses have been affected. Imagine a trusted employee accepts a job with one of your competitors and, in the process of accepting a position with them, decides to take your database with them. They don't want to use their credentials to access it for fear of being detected. They could dump the memory and get what's in the cache, but that's a lot of work. They could also dump the memory and look for the encryption keys, which is also a lot of work. Alternatively, they could just use the keyfile and dump the data. And once again, you've been pwned. Running host with KMIP (Key Management Interoperability Protocol). Insider attacks are some of the hardest to protect against because of the level of knowledge and access of employees. If it seems like no one is safe, that's what I'm trying to show you. If someone makes it onto your server while it's running, all bets are off, especially if they have root access. client-side field-level encryption encrypts the data on the application server before it's sent to the database — not even your DBAs can read the data. The keys that are being used are also in a key management system (KMS). Client-side field-level encryption should be used for all sensitive information. Secure by default Let's face it, security configuration is a full-time job, and you already have one of those. MongoDB Atlas provides a database that is secure by default. At a minimum, it always requires a username and password, it always requires you to use TLS, and there are network and IP access lists that further restrict access to the database. MongoDB Atlas uses a zero-trust design that complies with all the major regulatory frameworks organizations are subject to. When you use client-side field-level encryption for applications outside of mongoDB Atlas and put your data inside of Atlas, which is secure by default, you've added yet another layer between malicious insiders and your data. Try a preview version of Queryable Encryption to encrypt data end-to-end and query on randomly encrypted data. Try MongoDB Atlas for Free Today

August 10, 2022
Applied

MongoDB and IIoT: Data Streaming With Kafka

Event streaming has become a cornerstone of the industrial internet of things (IIoT) because it allows people to unleash the power of real-time operational data to drive applications and analytics. In this article, we share how MongoDB Atlas helps you move data seamlessly from the MQTT protocol into MongoDB time series collections using the Apache Kafka MQTT source and MongoDB sink connectors deployed in a cloud environment. Read the first and second articles in this four-part series on MongoDB and IIoT. Data streaming is the second step in our framework for end-to-end data integration in the manufacturing sector. The “connect” step of this framework deals with establishing an interface for interaction with IoT devices . The methodology discussed in this blog was developed and tested using a model factory created by Fischertechnik , but these steps are applicable to any environment that uses the standard MQTT protocol. All the source code for this project, along with a detailed deployment guide, can be found on our public Github repository. Figure 1. &nbsp;Step 2 of the end-to-end data integration framework. The challenge of collecting data On the shop floor, devices and components are continuously generating data related to their activity and environmental conditions at regular time intervals, typically known as time series data. In our factory model production line, there are a variety of sensors collecting data about temperature, pressure, humidity, brightness, camera positions, device/inventory status, and movements. This data is vital to monitor the health and effectiveness of factory equipment and its ability to continue to function without failure. The resulting datasets are often huge and must be efficiently stored and analyzed to detect anomalies or provide insight into overall equipment efficiency. With the advent of powerful event streaming platforms like Apache Kafka — and the wide variety of connectors for all sorts of protocols — it has become increasingly simple to handle the consolidation and export of real-time data feeds. However, dealing with such large volumes of data comes with added challenges regarding scalable storage, cost implications, and data archiving. This is where MongoDB’s time series collections come into play. Time series collections are a distinct type of MongoDB collections, optimized to efficiently store and process time series data by leveraging clustered indexes, columnar compression, and aggregation pipeline stages to facilitate real-time analytics. Learn more about time series collections on our tutorial page . Dream team: MQTT + Kafka + MongoDB Our recipe for collecting real-time sensor data (using the MQTT protocol) combines an MQTT source connector developed by Confluent and a native MongoDB sink connector deployed in a containerized environment. Figure 2. &nbsp;The components of the data streaming methodology. In this instance, we used a similar stack that includes Kafka Connect , a Kafka broker, and ZooKeeper deployed as containers in a single Docker compose file. This setup can be deployed locally, on a serverless backend or even Confluent Cloud. In our case, we have it deployed on an AWS EC2 Linux instance. Read our tutorial on how to set up a Kafka development environment with MongoDB connectors . Here’s a brief explanation of what each container does in this environment: Zookeeper: Acts a centralized controller that manages and organizes all the Kafka brokers. Kafka broker: Allows Kafka consumers to fetch messages by topic, partition, and offset. Kafka brokers can create a Kafka cluster by sharing information between each other. Kafka Connect: Serves as the runtime environment where you can configure connectors to ingest data into Kafka topics, making the data available for stream processing with low latency. It is worth noting that Kafka allows any number of sink and source connectors to be created in its environment as long as there are no server resource restrictions. Once the development environment is set up, all the necessary parameters are configured in the source and sink connectors. The source connector The source connector allows the Kafka broker to subscribe to MQTT topics. It serves to map the MQTT topics that contain the desired data parameters to a chosen Kafka topic. For simplicity, we’ve used Confluent’s MQTT source connector , which supports any kind of MQTT broker connection (self-hosted or otherwise). We’ve also used a managed MQTT service from HiveMQ as our remote broker. In the sample source connector configuration below, we’ve streamed sensor readings from multiple MQTT topics on the factory to a single Kafka topic called sensors using a string list of MQTT topics. We added the necessary access details to the remote broker from which Kafka will consume messages from the MQTT topic and save them as JSON values. Mapping several MQTT topics to the same Kafka topic does not affect the performance of the connector. { "name": "mqtt-source", "config": { "connector.class": "io.confluent.connect.mqtt.MqttSourceConnector", "tasks.max": "1", "mqtt.server.uri": "ssl://<REMOTE BROKER ADDRESS>:8883", "mqtt.username": "<REMOTE BROKER CLIENT>", "mqtt.password": "<REMOTE BROKER CLIENT PASSWORD>", "mqtt.topics": "i/ldr,i/bme680,i/cam", "kafka.topic": "sensors", "value.converter":"org.apache.kafka.connect.converters.ByteArrayConverter", "confluent.topic.bootstrap.servers": "broker:9092", "confluent.license": "", "topic.creation.enable": true, "topic.creation.default.replication.factor": -1, "topic.creation.default.partitions": -1 }} Figure 3. &nbsp;Sensor readings from multiple MQTT topics to a single Kafka topic. The sink connector While the source connector specifies the location from which data is retrieved, the sink connector specifies the destination to which data is sent. We used the MongoDB Kafka Sink Connector , which allowed us to connect to a MongoDB Atlas cluster with the right access information and choose which database and collection the streaming data was stored in. To receive the brightness readings captured in the source connector, the topics property in this connector must be set to match the name of the kafka.topic property in the former. { "name": "mongodb-sink", "config": { "connector.class":"com.mongodb.kafka.connect.MongoSinkConnector", "tasks.max":1, "topics":"sensors", "connection.uri":"mongodb+srv://user:password@address.mongodb.net/database?retryWrites=true&w=majority", "database":"<database name>", "collection":"<collection name>", "key.converter":"org.apache.kafka.connect.storage.StringConverter", "value.converter":"org.apache.kafka.connect.json.JsonConverter", "value.converter.schemas.enable":"false", "timeseries.timefield":"ts", "timeseries.timefield.auto.convert":"true", "timeseries.timefield.auto.convert.date.format":"yyyy-MM-dd'T'HH:mm:ss'Z'", "transforms": "RenameField,InsertTopic", "transforms.RenameField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", "transforms.RenameField.renames": "h:humidity, p:pressure, t:temperature”, "transforms.InsertTopic.type":"org.apache.kafka.connect.transforms.InsertField$Value", "transforms.InsertTopic.topic.field":"Source" }} Figure 4. &nbsp; The converter properties instruct the connector on how to translate data from Kafka. The converter properties in Figure 4 instruct the connector on how to translate data from Kafka. This configuration also automatically creates a time series collection in the requested database using the timeseries.timefield properties, which allowed us to choose which field in the original MQTT message qualifies as the timestamp and auto-convert that to a MongoDB-compatible date/time value format. Find out more about the configurable properties of our Kafka connectors in our detailed documentation . Smooth sailing with MongoDB Atlas Once the connectors have been configured and launched, Kafka listens on the mapped topics for any change events and translates this to documents in a time series collection. As long as the Kafka environment is running and connection with the MQTT broker remains unbroken, the time series collection is updated in real time and highly compressed (often more than 90%) to accommodate the continuous influx of data. See a sample of the time series collection we created in Figure 5. Figure 5. &nbsp; Streamed data saved in a MongoDB Atlas time series collection. As the expectations of consumability vary across organizations and personas, the underlying data structure can be further tailored for different use cases by using materialized views and simple aggregations. Read the first and second articles in this four-part series on MongoDB and IIoT. Since time series data is hardly ever changed and “cools down” over time, storing it in a hot data tier can become costly. To optimize costs, MongoDB Atlas provides Online Archive , which allows you to configure filter criteria to trigger automatic offloading of “cold” data to cheaper storage while maintaining its queryability. Once you start to receive accurate real-time data from the factory floor, a world of opportunity opens up in terms of getting insights from the collected data. In our next post, we will show you how to leverage the rest of the MongoDB Atlas product suite to run analytics on operational data, including using Atlas Charts for instant seamless data visualizations (see Figure 6). Figure 6. &nbsp; A sample dashboard created from factory sensor data in Atlas Charts. All the source code used in this project, along with a detailed deployment guide, is available on our public Github repo . Feel free to clone it and play around with configuring Kafka and its connectors. Most of the principles discussed in this post are applicable to any device that uses MQTT as a communication protocol. To learn more, watch our webinar session to see the full reference architecture, get tips for configuring your Kafka connectors, and see a live demonstration. If you have questions about other communication protocols or would like to consult with someone from our team about your project, please contact us .

August 10, 2022
Applied

How MongoDB Protects Against Supply Chain Vulnerabilities

Software supply chain vulnerabilities became national news in late 2020 with the discovery of the Solar Winds cyberattack. A year later, as if to put an exclamation point on the issue, the Log4j security flaw was discovered. Before these incidents, cybersecurity headlines typically focused on ransomware and phishing attacks, and organizations responded by increasing defensive measures, expanding network security beyond the perimeter, and mandating security awareness training. Protecting organizations from supply chain vulnerabilities, however, is a more complex undertaking. Download Supply Chain Security in MongoDB's Software Development Life Cycle Transparency and testing Few organizations have complete transparency into the software supply chain. The software supply chain includes all components — third-party dependencies, open source scripts, contractors, and other miscellaneous components and drivers — directly involved in developing an application. When dealing with a dozen or more vendors, applications, and service providers, it's hard to know all the elements that comprise your organization's software supply chain. As a backend solutions provider with open source roots, MongoDB is keenly aware of the need for security and transparency in the software supply chain. Long before supply chain vulnerabilities became national news, we implemented numerous safeguards to ensure the security of our products throughout the software development life cycle (SDLC). For example, in the planning stage, we look at our software from an attacker's perspective by trying to find ways to bypass authentication and gain unauthorized access. In the sprint stage, we conduct thousands of CPU hours of tests every week, and we run builds on thousands of compute nodes 24/7 on different combinations of every major hardware platform, operating system, and software language. And in the deployment stage, we perform hundreds of hours of automated testing to ensure correctness on every source code commit. We also invite the MongoDB Community and other third parties to submit reports of bugs found in our products, both open source and enterprise packages. Finally, we conduct periodic bug hunts with rewards for community members who contribute by improving a release. Securing third-party software The area that organizations have the least visibility into is perhaps the use of third-party libraries. Almost all applications use software that was written by someone else. According to some industry estimates, third-party libraries make up between 30% and 90% of typical applications. At MongoDB, all third-party libraries are evaluated and vetted by the security team before being incorporated into MongoDB products. We also use security tools to scan source code, identify known security vulnerabilities, and test against government benchmarks like Common Vulnerability and Exposure (CVE) and Common Weakness Enumeration (CWE), as well as private-entity frameworks like the SANS Institute’s list of software vulnerabilities. If we identify a vulnerability, we use the IETF Responsible Vulnerability Disclosure Process to evaluate and mitigate the issue, communicate with our user base, and perform a postmortem assessment. Details are also published to the MongoDB Alerts page along with release notes and a description of fixes. Using SBOMs To encourage even more transparency within the software supply chain, we've been at the forefront of the push for a software bill of materials (SBOM, pronounced “S-Bomb”). A software bill of materials is a list of ingredients used by an application, including all the libraries and components that make up an application, whether they are third-party, commercial off-the-shelf (COTS), or open source. By providing visibility into all of the individual components and dependencies, SBOMs are seen as a critical tool for improving software supply chain security. MongoDB’s CISO, Lena Smart, recently conducted a panel discussion with a handful of cybersecurity experts on the need for SBOMs in the wake of President Joe Biden’s executive order on supply chain security . Vulnerabilities in software will always exist, and the determination of malicious actors means that some of those vulnerabilities will be exploited. MongoDB believes that secure digital experiences start with secure software development. That means having the proper controls in place, continuously probing for weaknesses, and maintaining transparency in the CI/CD pipeline. For more detailed information, download our white paper Supply Chain Security in MongoDB's Software Development Life Cycle .

August 9, 2022
Applied

4 Critical Features for a Modern Payments System

The business systems of many traditional banks rely on solutions that are decades old. These systems, which are built on outdated, inflexible relational databases, prevent traditional banks from competing with industry disruptors and those already adopting more modern approaches. Such outdated systems are ill-equipped to handle one of the core offerings that customers expect from banks today — instantaneous, cashless, digital payments . The relational database management systems (RDBMSes) at the core of these applications require breaking data structures into a complex web of tables. Originally, this tabular approach was necessary to minimize memory and storage footprints. But as hardware has become cheaper and more powerful, these advantages have also become less relevant. Instead, the complexity of this model results in data management and programmatic access issues. In this article, we’ll look at how a document database can simplify complexity and provide the scalability, performance, and other features required in modern business applications. Document model To stay competitive, many financial institutions will need to update their foundational data architecture and introduce a data platform that enables a flexible, real-time, and enriched customer experience. Without this, new apps and other services won’t be able to deliver significant value to the business. A document model eliminates the need for an intricate web of related tables. Adding new data to a document is relatively easy and quick since it can be done without the usually lengthy reorganization that RDBMSes require. What makes a document database different from a relational database? Intuitive data model simplifies and accelerates development work. Flexible schema allows modification of fields at any time, without disruptive migrations. Expressive query language and rich indexing enhance query flexibility. Universal JSON standard lets you structure data to meet application requirements. Distributed approach improves resiliency and enables global scalability. With a document database, there is no need for complicated multi-level joins for business objects, such as a bill or even a complex financial derivative, which often require object-relational mapping with complex stored procedures. Such stored procedures, which are written in custom languages, not only increase the cognitive load on developers but also are fiendishly hard to test. Missing automated tests present a major impediment to the adoption of agile software development methods. Required features Let’s look at four critical features that modern applications require for a successful overhaul of payment systems and how MongoDB can help address those needs. 1. Scalability Modern applications must operate at scales that were unthinkable just a few years ago, in relation to both transaction volume and to the number of development and test environments needed to support rapid development. Evolving consumer trends have also put higher demands on payment systems. Not only has the number of transactions increased, but the responsive experiences that customers expect have increased the query load, and data volumes are growing super-linear. The fully transactional RDBMS model is ill suited to support this level of performance and scale. Consequently, most organizations have created a plethora of caching layers, data warehouses, and aggregation and consolidation layers that create complexity, consume valuable developer time and cognitive load, and increase costs. To work efficiently, developers also need to be able to quickly create and tear down development and test environments, and this is only possible by leveraging the cloud. Traditional RDBMSes, however, are ill suited for cloud deployment. They are very sensitive to network latency, as business objects spread across multiple tables can only be retrieved through multiple sequential queries. MongoDB provides the scalability and performance that modern applications require. MongoDB’s developer data platform also ensures that the same data is available for use with other frequent consumption patterns like time series and full-text search . Thus, there is no need for custom replication code between the operational and analytical datastore. 2. Resiliency Many existing payment platforms were designed and architected when networking was expensive and slow. They depend on high-quality hardware with low redundancy for resilience. Not only is this approach very expensive, but the resiliency of a distributed system can never be reached through redundancy. At the core of MongoDB’s developer data platform is MongoDB Atlas , the most advanced cloud database service on the market. MongoDB Atlas can run in any cloud, or even across multiple clouds, and offers 99.995% uptime. This downtime is far less than typically expected to apply necessary security updates to a monolithic legacy database system. 3. Locality and global coverage Modern computing demands are at once ubiquitous and highly localized. Customers expect to be able to view their cash balances wherever they are, but client secrecy and data availability rules set strict guardrails on where data can be hosted and processed. The combination of geo-sharding, replication, and edge data addresses these problems. MongoDB Atlas in combination with MongoDB for Mobile brings these powerful tools to the developer. During the global pandemic, more consumers than ever have begun using their smartphones as payment terminals. To enable these rich functions, data must be held at the edge. Developing the synchronization of the data is difficult, however, and not a differentiator for financial institutions. MongoDB for Mobile, in addition with MongoDB’s geo-sharding capability on Atlas cloud, offloads this complexity from the developer. 4. Diverse workloads and workload isolation As more services and opportunities are developed, the demand to use the same data for multiple purposes is growing. Although legacy systems are well suited to support functions such as double entry accounting, when the same information has to be served up to a customer portal, the central credit engine, or an AI/ML algorithm, the limits of the relational databases become obvious. These limitations have resulted in developers following what is often called “best-of-breed” practices. Under this approach, data is replicated from the transactional core to a secondary, read-only datastore based on technology that is better suited to the particular workload. Typical examples are transactional data stores being copied nightly into data lakes to be available for AI/ML modelers. The additional hardware and licensing cost for this replication are not prohibitive, but the complexity of the replication, synchronization, and the complicated semantics introduced by batch dumps slows down development and increases both development and maintenance costs. Often, three or more different technologies are necessary to facilitate the usage patterns. With its developer data platform, MongoDB has integrated this replication, eliminating all the complexity for the developers. When a document is updated in the transactional datastore, MongoDB will automatically make it available for full-text search and time series analytics. The pace of change in the payments industry shows no signs of slowing. To stay competitive, it’s vital that you reassess your technology architecture. MongoDB Atlas is emerging as the technology of choice for many financial services firms that want to free their data, empower developers, and embrace disruption. Replacing legacy relational databases with a modern document database is a key step toward enhancing agility, controlling costs, better addressing consumer expectations, and achieving compliance with new regulations. Learn more by downloading our white paper “Modernize Your Payment Systems."

August 8, 2022
Applied

Navigating the Future of Data Sovereignty With MongoDB

There are 2.5 quintillion bytes of data created every day , and more and more of that data is being stored in a public cloud. The rise of cloud data storage brings with it a focus on data sovereignty. Governments and industry regulatory bodies are cracking down on protecting user data. At any given time, organizations must know where its data is located, replicated, and stored — as well as how it is collected and processed, prioritizing personal data privacy all along the way. The challenge of GDPR compliance A PwC survey found that 92% of U.S. companies consider GDPR a top data protection priority , and rightly so, as there is pressure from both governments and citizens to protect user data. A recent Vormetric survey found that 85% of American consumers said that if significant personal consequences resulted from their information being compromised as part of a breach, they’d take their business elsewhere. Without a strong handle on data sovereignty, organizations are risking millions of dollars in regulatory fees for mishandling data, loss of brand credibility, and distrust from customers. Where to start with data sovereignty Creating a proper structure for data sovereignty can be complex, and as big data gets bigger, so will the breadth and depth of regulations. The GDPR of today may not resemble the GDPR of tomorrow, and more laws continue to be rolled out at the federal, state, and industry levels. GDPR, while the most notable, is not the only data regulation policy that businesses must consider. California has rolled out the California Consumer Privacy Act, and there are numerous countries that have similar laws in place to protect consumer data and regulate how data is managed, including Japan, India, Egypt, and Australia. And as these regulations continue to be introduced, organizations will need to keep pace to avoid damage to their businesses. Major considerations that impact data sovereignty include: Process: How is your company going to maintain compliance for data sovereignty with efficiency? Infrastructure: Is a legacy infrastructure holding you back from being able to easily comply with data regulations? Scaling: Is your data architecture agile enough to meet regulations quickly as they grow in breadth and complexity? Cost: Are you wasting time and money by leveraging manual processes to adhere to governmental regulations and risking hefty fees attached to incompliance? Penalties: Are your business leaders fully aware of the costs associated with noncompliance? GDPR violations can exact up to €10 million (an average of 2% to 4% of organizational revenue) in penalties. Learn more about strong controls for critical data privacy at our upcoming webinar on queryable encryption . Managing data sovereignty with MongoDB Atlas MongoDB enables you to easily comply with most data privacy regulations. MongoDB Atlas , our cloud database as a service, includes intuitive security features and privacy controls, including: Queryable encryption : Revolutionary to the industry and currently in preview with MongoDB 6.0, queryable encryption enables encryption of sensitive data from the client side, stored as fully randomized, encrypted data on the database server side. This feature delivers the utmost in security without sacrificing performance, ensuring that even the most critical and sensitive workloads are safe and performant in a public cloud. MongoDB Atlas global clusters : It is no longer sustainable or advantageous to build applications across geographic areas and jurisdictions. Doing so requires more infrastructure, more maintenance, more management, and, in turn, more complexity and more resources exhausted. Atlas global clusters allow organizations with distributed applications to geographically partition a fully managed deployment in a few clicks and control the distribution and placement of their data with sophisticated policies that can be easily generated and changed. This means that not only can your organization achieve compliance with regulations containing data residency requirements more easily, but you can also reduce overhead at the same time. Virtual private clouds (VPCs): Each MongoDB Atlas project is provisioned into its own VPC, thereby isolating your data and underlying systems from other MongoDB Atlas users. This allows businesses to meet data sovereignty requirements while staying highly available within each region. Each shard of data will have multiple nodes that automatically and transparently failover for zero downtime, all within the same jurisdiction. Being able to meet data residency requirements is another big technical challenge made simple with MongoDB Atlas . Further, businesses can connect Atlas VPCs to customer infrastructure via private networking (including private endpoints and VPC peering) for increased security. IP whitelists : IP whitelists allow you to specify a specific range of IP addresses against which access will be granted, delivering granular control over data. Client-side field-level encryption (CSFLE) : This feature dramatically reduces the risk of unauthorized access or disclosure of sensitive data. Fields are encrypted before they leave your application, protecting them everywhere: in motion over the network, in database memory, at rest in storage and backups, and in system logs. Dig deeper into data sovereignty To learn more about strong controls for critical data privacy, join MongoDB’s webinar on August 24, 2022 . Our experts will focus on queryable encryption, the industry’s first encrypted search scheme, and how, with MongoDB Atlas, your data is protected with preconfigured security features for authentication, authorization, encryption, and more. Register for our queryable encryption webinar on August 22, 2022 .

August 3, 2022
Applied

Introducing the Ability to Independently Scale Analytics Node Tiers for MongoDB Atlas

We’re excited to announce analytics node tiers for MongoDB Atlas! Analytics node tiers provide greater control and flexibility by allowing you to customize the exact infrastructure you need for your analytics workloads. Analytics node tiers provide control and flexibility Until now, analytics nodes in MongoDB’s Atlas clusters have used the same cluster tier as all other nodes. However, operational and analytical workloads can vary greatly in terms of resource requirements. Analytics node tiers allow you to enhance the performance of your analytics workloads by choosing the best tier size for your needs. This means you can choose an analytics node tier larger or smaller than the operational nodes in your cluster. This added level of customization ensures you achieve the performance required for both transactional and analytical queries — without the need to over- or under-provision your entire cluster for the sake of the analytical workload. Analytics node tiers are available in both Atlas and Atlas for Government . A standard replica set contains a primary node for reads and writes and two secondary nodes that are read only. Analytics nodes provide an additional read-only node that is dedicated to analytical reads. Choose a higher or lower analytics node tier based on your analytics needs Teams with large user bases using their BI dashboards may want to increase their analytics node tiers above that of their operational nodes. Choosing a higher tier can be useful when you have many users or require more memory to serve analytics needs. Scaling up the entire cluster tier would be costly, but scaling up just your analytics node tiers helps optimize the cost. Teams with inconsistent needs may want to decrease their analytics node tier below that of their operational nodes. The ability to set a lower tier gives you flexibility and cost savings when you have fewer users or analytics are not your top priority. With analytics node tiers, you get more discretion and control over how you manage your analytics workloads by choosing the appropriately sized tier for your analytics needs. Get started today by setting up a new cluster or adding an analytics node tier to any existing cluster. Check out our documentation to learn more.

August 3, 2022
Updates

Video: Canva's Lessons From Scaling MongoDB Atlas to 10 Billion Documents Across 100 Nodes

Running complex, global, and mission-critical infrastructure at scale is difficult, and anyone who has done it for any length of time usually has a few gnarly lessons to share. At MongoDB World in June 2022, we were lucky enough to feature someone who had done just that. Michael Pearson , software engineering team lead at Canva , gave a talk titled “10 Billion Documents: How Canva Scaled MongoDB to 100 Nodes.” I’ve had the pleasure of working alongside Pearson and his team for almost a year now, and his presentation focused on some of the massive challenges (and lessons) they’ve faced over the last two years as they have scaled into tens of terabytes of data and tens of billions of documents. I’m writing this blog to give a few highlights, but I’d recommend everyone check the original talk in full: A tricky problem For the uninitiated, Canva is a visual communication platform that empowers its users to design anything and publish anywhere. Or, as Pearson explained in his talk, “Canva is a really simple way to create beautiful designs and presentations.” Canva’s mission is to empower the world to design, and more than 85 million people in over 190 countries use the platform every month. As you can imagine, this presents a huge data challenge — and opportunity. Canva holds more than 10 billion designs and receives up to 30,000 document requests per second. The success of the platform comes down to providing a fantastic user experience every time, and to do that they need to present their customers with the right data at the right time. “This could be a really tricky problem for a database platform, particularly for a business based in Sydney with many users on the other side of the world,” said Pearson. MongoDB Atlas supports the Document Service, which enables opening, creating, updating, or deleting any design on Canva. The Document Service is critical for every single user — if the Document Service is down, then Canva’s users can’t design. But before we get too far into things, we should probably start with why Canva started using MongoDB in the first place. Flexibility to support rapidly changing requirements Michael Pearson, software engineering team lead at Canva. “Canva was launched to the world back in 2013, when MongoDB was very new to the scene,” explains Pearson. “I'm not sure if there were any other databases that would have been up for the challenge.” From those earliest days, MongoDB's flexible document model was the perfect fit for Canva's increasingly complex designs and document types. “The flexibility that MongoDB gave us in those early days was instrumental to our success. As the Canva platform evolved, we were throwing new schema and new features at it. MongoDB would just handle it.” Its continued innovation and problem-solving means MongoDB remains as valuable to us today as it was in 2012. Michael Pearson, software engineering team lead at Canva At the same time, it was essential that Canva’s engineering team was focused on building Canva, rather than time spent managing the data platform. With that in mind, Canva chose to run MongoDB as a service. After trying out multiple options, they went with MongoLabs and, in 2019, following MongoDB's acquisition of MongoLabs, Canva migrated onto MongoDB Atlas , running on AWS, where they remain to this day. Ten years of relative bliss “Before 2021, we had a very hands-off approach to how we used MongoDB,” said Pearson. “MongoDB just handled it. We didn't have to think about it at all." That's incredible, right? Think about it — for nearly a decade the team barely had to think about their data layer and could spend their time working on new features and making the actual product better for its millions of users around the world. It's what every developer wants. Eventually, though, Canva’s own success created certain challenges around scaling. With the stratospheric increase in growth, the load on the Document Service also continued to increase. MongoDB’s ability to scale horizontally through the use of sharding was critical to overcoming initial scale challenges, something that traditional database management systems would have struggled to achieve, said Pearson. With MongoDB, sharding is distributed or partitioned across multiple machines — useful when no single machine can handle large workloads. In due course, though, some attributes of Canva’s workload presented a new challenge. Said Pearson: “We were unique in that we have one cluster with one collection with a million chunks. Our documents are fairly large, given our product has evolved over the years and we put more and more stuff into our documents.” Or, Canva does many updates to relatively large documents, and by mid-2021 the surge in traffic was causing issues. “Our peak traffic caused three main problems: inability to run the balancer, latency issues, and a disk usage pretty much at capacity,” Pearson explained. “A really ineffective cache caused a really high write load to our cluster. This was causing downstream failures." Pearson discussed some of the tactical solutions the company took. “Disabling the balancer immediately brought us back to service, but now we knew that there was something wrong with that cluster and we couldn’t operate without the balancer,” said Pearson. “We also noticed that the number of chunks in our cluster had skyrocketed, from around 400,000 to just over a million.” Getting to the root of the problem The professional services team at MongoDB discovered that “metadata updates were causing anywhere from a one-second to five-minute stalls in the cluster.” Going from 400,000 chunks to a million chunks, at the rate of a minute of each change, was not optimal. There were three things to address with that cluster: reduce the number of chunks, reduce that disk contention, and reduce the size of documents. “With regard to reducing the number of chunks, we just took any contiguous chunks on a shard and merged them unconditionally,” said Pearson. “This was tooling built in collaboration with MongoDB.” After three months of merging chunks, Canva saw massive improvements in its cluster’s performance. A failure rate during reboot of around 4% dwindled to less than 1% during maintenance operations. Further, to address latency spikes and full-disk capacity, the team formulated a six-step plan to move from network-based storage volumes to locally attached disks. This has proved a huge success. “We were able to run the balancer. Our large spikes in latency were pretty much all gone, and our disk usage was almost at zero,” Pearson said. He continued: "The key takeaway for me is that sharding is great, but it's never a silver bullet. I don't think we would have caught these issues so quickly without such a thorough incident review process and such a close working relationship with MongoDB." What was learned? After presenting all of that information, Pearson closed out the presentation with a few key lessons. For anyone interested in running infrastructure at a massive scale, they are simple and worth taking note of: Take advantage of the flexible document model to accelerate your pace of development. Ensure chunks are distributed uniformly across the cluster in a consistent size. Maintain a thorough incident review process and include your trusted partners (such as MongoDB). Reliability is an essential part of Canva’s engineering practice, and prolonged service disruptions were particularly upsetting not only for engineers but for Canva’s global users. Pearson is glad to report that Canva has seen a turnaround in the number of incidents impacting its Document Service. This has freed the document team to shift focus back to shipping features and ensuring every user has a flawless experience using Canva. Interested in joining Canva as it pursues its mission to empower the world to design? Canva is looking for a software engineer to join its Core Data team. Want to take advantage of the flexible document model to accelerate your pace of development? Learn more about MongoDB Atlas .

August 2, 2022
Applied

Tools for Implementing Zero Trust Security With MongoDB

The practice of protecting IT environments from unauthorized access used to be centered on perimeter security — the strategy of securing the perimeter but allowing unrestricted access inside it. As users became increasingly mobile and IT assets became increasingly dispersed, however, the notion of a network perimeter became obsolete. That strategy has now been replaced by the concept of zero trust. In a zero trust environment, the perimeter is assumed to have been breached. There are no trusted users, and no user or device gains trust simply because of its physical or network location. Every user, device, and connection must be continually verified and audited. MongoDB offers several tools and features for integrating our products into a zero trust environment, including: Security by default Multiple forms of authentication TLS and SSL encryption X.509 security certificates Role-based access control (RBAC) Database authentication logs Encryption for data at rest, in flight, and in use For government customers, MongoDB Atlas for Government is FedRAMP-ready. Security by default MongoDB Atlas clusters do not allow for any connectivity to the internet when they’re first spun up. Each dedicated MongoDB Atlas cluster is deployed in a unique virtual private cloud (VPC) configured to prohibit inbound access. (Free and shared clusters do not support VPCs.) The only way to access these clusters is through the MongoDB Atlas interface. Users can configure IP access lists to allow certain addresses to attempt to authenticate to the database. Without being included on such a list, application servers are unable to access the database. Even the person who sets up the clusters needs to add their IP address to the access list. To find out more about the security measures that protect our cloud-based database, MongoDB Atlas, and the rules governing employee access, read our whitepaper, MongoDB: Capabilities for Use in a Zero Trust Environment . Authentication Customers have several options to allow users to authenticate themselves to a database, including a username and password, LDAP proxy authentication, and Kerberos authentication. All forms of MongoDB support transport layer security (TLS) and SCRAM authentication. They are turned on by default and cannot be disabled. Traffic from clients to MongoDB Atlas is authenticated and encrypted in transit, and traffic between a customer’s internally managed MongoDB nodes is also authenticated and encrypted in transit using TLS. For passwordless authentication, MongoDB offers two different options to support the use of X.509 certificates. The first option, called “easy,” auto-generates the certificates needed to authenticate database users. The “advanced” option is for organizations already using X.509 certificates and that already have a certificate management infrastructure. The advanced option can be combined with LDAPS for authorization. Access infrastructure can only be reached via bastion hosts and by users for whom senior management has approved backend access. These hosts require multifactor authentication and are configured to require SSH keys — not passwords. Logging and auditing MongoDB supports a wide variety of auditing strategies, making it easier to monitor your zero trust environment to ensure that it remains in force and encompasses your database. Administrators can configure MongoDB to log all actions or apply filters to capture only specific events, users, or roles. Role-based auditing lets you log and report activities by specific role, such as userAdmin or dbAdmin, coupled with any roles inherited by each user, rather than having to extract activity for each individual administrator. This approach makes it easier for organizations to enforce end-to-end operational control and maintain the insight necessary for compliance and reporting. The audit log can be written to multiple destinations in a variety of formats, such as to the console and syslog (in JSON) and to a file (JSON or BSON). It can then be loaded to MongoDB and analyzed to identify relevant events. Encryption MongoDB also lets you encrypt data in flight, at rest, or even, with field-level encryption and queryable encryption , in use. For data in motion, all versions of MongoDB support TLS and SSL encryption. For data at rest, MongoDB supports AES-256 encryption, and it can also be configured for FIPS compliance. To encrypt data when it is in use, MongoDB offers client-side field-level encryption , which can be implemented to safeguard data even from database administrators and vendors who otherwise would have access to it. Securing data with client-side field-level encryption allows you to move to managed services in the cloud with greater confidence. The database only works with encrypted fields, and organizations control their own encryption keys, rather than having the database provider manage them. This additional layer of security enforces an even more fine-grained separation of duties between those who use the database and those who administer and manage it. MongoDB Atlas exclusively offers queryable encryption, which allows customers to run rich expressive queries on fully randomized encrypted data with efficiency, improving both the development process and user experience. Organizations are able to protect their business by confidently storing sensitive data and meeting compliance requirements. Zero trust and MongoDB MongoDB is optimally suited for use within a zero trust environment. MongoDB is secure by default and has developed industry-leading capabilities in key areas such as access, authorization, and encryption. Used together, these features help protect the database from outside attackers and internal users who otherwise could gain an unauthorized level of access. For more detailed information about security features in MongoDB, read our whitepaper, MongoDB: Capabilities for Use in a Zero Trust Environment .

August 2, 2022
Applied

Integrating Support for Non-Volatile Memory Into WiredTiger

Intel Optane DC Persistent Memory is a non-volatile memory (NVRAM) product that resembles both storage and memory and can be used as either. Like storage, Optane NVRAM retains data after a crash or power outage. Like memory, it sits on the memory bus and can be accessed by CPU using load/store instructions. In certain scenarios, its access latency even approaches that of dynamic random access memory (DRAM). At MongoDB, we have been thinking about how to use NVRAM in the storage engine. It can be seen as an extension of volatile DRAM, but a denser and a cheaper one. In pursuit of this goal, we extended our storage engine, WiredTiger , with a volatile NVRAM cache that retains frequently used file blocks. In this article, we share our experience, describe the lessons learned, and evaluate the costs and benefits of this approach. How to use NVRAM in the storage stack Optane NVRAM can act as both storage and memory. The persistent memory fabric itself can be packaged as a solid-state drive (SSD), as in Optane SSDs, or as a dual-inline memory module (DIMM) that looks almost like its DRAM counterpart and lives in the same type of slot on the motherboard. Even when NVRAM is packaged as a non-volatile DIMM (NVDIMM), we can ask the operating system to present it as a block device, put a file system on top, and use it just like regular storage. Broadly speaking, there are three ways to use NVRAM: As regular storage As persistent memory As an extension to volatile memory NVRAM as storage Using NVRAM as regular storage can deliver superior throughput (compared to SSD) for read-dominant workloads, but this approach hinders write-dominant workloads because of Optane NVRAM’s limited write throughput (see the section “Performance properties of Optane NVRAM”). In any case, both the price and density of NVRAM are closer to those of DRAM than to those of SSD, so using it as storage is not recommended. NVRAM as persistent memory Imagine that all your data structures live in memory and that you never have to worry about saving them to files. They are just there, even after you quit your application or if it suffers a crash. Although this setup sounds simple, in practice, it is still challenging to program for this model. If your system crashes and you would like to be able to find your data after restart, you need to name it. A variable name is not sufficient, because it is not unique; thus, you have to restructure your code to make sure your data has persistent identifiers. Persistent Memory Development Kit (PMDK) provides APIs for that. A more difficult problem is surviving a crash. Your program may crash in the middle of a logical operation on a data structure. For example, suppose you are inserting an item into a linked list, and you have set the source pointer, but the crash occurs before setting the destination pointer. Upon restart, you’ll end up with corrupted data. To make matters worse, even if the logical operation had completed before the crash, the data might have been written only to CPU caches but not persisted to the memory itself . One solution is to wrap memory operations in transactions; however, programming transactional memory is notoriously difficult. Another solution is to use prepackaged data structures and APIs, but if you are looking to create your own highly optimized data structures, you must implement your own logging and recovery or other mechanisms that protect your data similarly to transactions. NVRAM as an extension of volatile memory Somewhat counterintuitively, this option involves disregarding the persistence of NVRAM and using it as a volatile extension of DRAM. Why would you want to do that? Suppose you have a fixed budget to buy extra memory for your system. You can either afford N GB of DRAM or about M*N GB of NVRAM — that’s because NVRAM is denser and cheaper per byte than DRAM (about three times cheaper, at the time of writing). Depending on your application, you might be better off in terms of performance/$$ if you buy additional NVRAM, as opposed to DRAM. In support of this use case, Intel provides a hardware mechanism, called Memory Mode, which treats NVRAM as “regular” system memory and uses DRAM as its cache. In other words, the hardware will do its best to place frequently used data structures in DRAM, and the rest will reside in NVRAM. The beauty of this mechanism is that it requires absolutely no changes to applications. The downside is that it may perform worse than a custom solution for certain workloads (see section “How NVCache affects performance”). Our solution is a custom-built volatile cache that resides in NVRAM. Our architecture Our NVRAM cache (or NVCache) is a component of the MongoDB storage engine WiredTiger. For persistent storage, WiredTiger organizes data into blocks, where keys and values are efficiently encoded and (optionally) compressed and encrypted. For fast query of its B+-tree data structure, WiredTiger transforms blocks into pages, where keys/values are decoded and indexed. It keeps pages in its DRAM page cache. Figure 1. &nbsp;The architecture of NVCache. Figure 1 shows the architecture of NVCache. NVCache is the new component, and the rest are part of WiredTiger. NVCache sits next to the block manager, which is the code responsible for reading/writing the data from/to persistent storage. Let’s look at each path in turn. Read path: If the page cache cannot locate the searched-for data, it issues a read to the block manager (1). The block manager checks whether the block is present in NVCache (2), accesses it from NVCache if it is (3), and reads it from disk if it is not (4). The block manager then transforms the block into a page, decrypting and decompressing it if needed, and then hands it over to the page cache (5). It also notifies NVCache that it has read a new block, and NVCache then has the discretion to accept it (6). NVCache stores the blocks in the same format as they are stored on disk (e.g., compressed or encrypted if those configuration options were chosen). Write path: The write path differs from the read path in that WiredTiger does not modify disk blocks in place. It writes updates into in-memory data structures and then converts them into new pages, which would be sent to disk either during eviction from the page cache or during a checkpoint (7). When the block manager receives a new page, it converts it into a new block, writes the block to storage (8), and informs NVCache (9). NVCache then has the discretion to accept it. Obsolete blocks are eventually freed, at which time the block manager instructs NVCache to invalidate cached copies (10). To avoid running out of space, NVCache periodically evicts less-used blocks. The eviction thread runs once a second. Overall, this design is straightforward, but making it performant was a challenge. As expected with brand new storage or memory devices, the software must cater to their unique performance properties. In the next section, we focus on these performance features and explain how we adapted our cache to play along. Performance properties of Optane NVRAM In low-bandwidth scenarios, the access latency of Optane NVRAM approaches that of DRAM. A small read takes about 160 to 300 nanoseconds, depending on whether it is part of a sequential or a random access pattern1; a read from DRAM takes about 90 nanoseconds.3 Small writes are as fast as in DRAM3 because the data only has to reach the memory controller, where it will be automatically persisted in case of a power loss. In high-bandwidth scenarios, we usually look at throughput. Sequential read throughput is about 6 GB/s for a single NVDIMM 1,2 and scales linearly as you add more memory modules. (A single 2nd Generation Intel Xeon Scalable processor can support up to six NVDIMMs.) The write throughput is more limited: We observed up to 0.6 GB/s on a single NVDIMM2, and others observed up to 2.3 GB/s. 3 Again, if your workload writes to different NVDIMMs, the throughput will scale with the number of modules in your system. A somewhat troublesome observation about write throughput is that it scales negatively as you add more threads. Write throughput peaks at one or two concurrent threads and then drops as more threads are added. 2,3 More importantly, we were surprised to find that, on Optane NVRAM, the presence of writers disproportionately affects the throughput of readers. Figure 2. &nbsp;Read throughput in presence of concurrent writer threads. Figure 2 shows how the throughput of eight reader threads drops as more concurrent writers are added. Although this effect is present on both DRAM and NVRAM (and certainly on other storage devices), on Optane NVRAM, the effect is much more pronounced. Performance of reads will suffer in the presence of writes. This important observation drove the design of our NVCache. Throttling writes in caches for Optane NVRam For a cache to be useful, it must contain popular data. The duties of admitting fresh data and expunging the old fall on cache admission and eviction policies, respectively. Both admission and eviction generate writes, and, because writes hurt the performance of reads on Optane, admission and eviction will interfere with performance of cache retrievals (which involve reads). Thus, we have a trade-off: On one hand, admission and eviction are crucial to making the cache useful. On the other hand, the write operations that they generate will hamper the performance of data retrievals, thereby making cache less performant. To resolve this tension, we introduced the Overhead Bypass (OBP) metric, which is a ratio of reads and writes applied to the cache. Keeping this ratio under a threshold allowed us to limit the overhead of writes: OBP = (blocks_inserted + blocks_deleted) / blocks_looked_up Intuitively, blocks_looked_up correlates with the benefit of using the cache, whereas the sum of blocks_inserted and blocks_deleted correlates with the cost. NVCache throttles admission and eviction to keep this ratio under 10%. (Our source code is available in the WiredTiger public GitHub repository .) Without OBP, the sheer overhead of data admission and eviction was quite substantial. To measure this overhead in its purest form, we experimented with workloads that do not stand to benefit from any extra caching, such as those with small datasets that fit into the OS buffer cache (in DRAM) or those that perform so many writes that they quickly invalidate any cached data. We found that using NVCache without the OBP feature caused these workloads to run up to two times slower than without the cache. Introducing the OBP completely eliminated the overhead and enabled the workloads that stand to benefit from extra caching to enjoy better performance. How NVCache affects performance In this section, we’ll look in detail at the performance of workloads with large datasets that stand to benefit from an additional cache. Experimental system: The following experiments were performed on a Lenovo ThinkSystem SR360 with two Intel Xeon Gold 5218 CPUs. Each CPU has 16 hyper-threaded cores. The system has two Intel Optane persistent memory modules of 126 GB each. For storage, we used an Intel Optane P4800X SSD. We configured our system with only 32 GB of DRAM to make sure that extra memory in the form of NVRAM would be called for. We present the data with widely used YCSB benchmarks 4,5 (Table 1), although we also performed analysis with our in-house benchmarks and reached similar conclusions. Table 1. &nbsp; Characteristics of YCSB benchmarks The following charts compare the throughput of YCSB with NVCache, with Intel Memory Mode (MM), and with OpenCAS6 — a kernel implementation of NVRAM-resident cache from Intel. OpenCAS was configured in the write-around mode, which was the best option for limiting the harmful effect of writes.7 Figures 3a-c shows the data in configurations using 63 GB, 126 GB, and 252 GB of NVRAM, respectively. Figure 3. &nbsp; Throughput of YCSB under Memory Mode (MM), OpenCAS, and NVCache relative to running with DRAM only. We make the following three observations: OpenCAS cache delivers no performance benefit from extra NVRAM. It achieves a similar or better read hit rate as the NVCache but also makes two orders of magnitude more writes to NVRAM, probably because it does not throttle the rate of admission. Writes interfere with performance of reads, which is probably why this cache delivers no performance benefits. When the dataset size exceeds NVRAM capacity, NVCache provides substantially better performance than Memory Mode. As shown in Figure 3a, NVCache outperforms the memory mode by between 30% (for YCSB-B) and 169% (for YCSB-C). Furthermore, the memory mode hurts YCSB-A’s update throughput by about 18% relative to the DRAM-only baseline, while NVCache does not. Memory mode performs comparably to NVCache when NVRAM is ample. With 252 GB of NVRAM, all datasets comfortably fit into the NVRAM. Two factors explain why NVCache loses its edge over MM with ample NVRAM: (1) For NVCache, the marginal utility of additional NVRAM is small after 126 GB; NVCache hit rate grows by about 20% when we increase NVRAM size from 63 GB to 126 GB, but only by another 5% if we increase it from 126 GB to 252 GB. (2) While MM allows the kernel buffer cache to expand into NVRAM, NVCache confines it to DRAM, which is also used by the WiredTiger’s page cache. Contention for DRAM limits performance. Overall, the benefit of a custom NVRAM cache solution is that it provides better performance than the Memory Mode for large workloads. The disadvantage is that it requires new software, whereas MM can be used without any changes to applications. Performance and cost In this section, we explore the trade-offs of using Optane NVRAM as a volatile extension of DRAM versus just using more DRAM. To that end, we take a fixed memory budget of 96 GB and vary the fraction satisfied by DRAM and NVRAM as shown in Table 2. Table 2. &nbsp; Budget of memory configurations containing both DRAM and NVRAM relative to DRAM-only. We use the NVRAM-to-DRAM price ration of 0.38. 8 Figure 4. &nbsp; Performance per dollar as the amount of NVRAM increases and the amount of DRAM decreases (in YCSB workloads). Figure 4 shows the performance of YCSB under these configurations normalized to using 96 GB DRAM and divided by the cost ratio in column 3. In other words, these are performance/$ numbers relative to the DRAM-only configuration. In these experiments, we used only NVCache to manage NVRAM, as it performed comparably to or better than other options. Positive numbers mean that the performance decreased less than the memory cost. Read-only or read-mostly workloads that benefit from the NVCache experience a positive gain, as expected. Although in most cases performance predictably drops as the amount of DRAM decreases, YCSB-C in configuration with 64 GB NVRAM and 32 GB DRAM performs better than it does with 96 GB DRAM — so we decrease the system cost and improve performance in absolute terms. This occurs because beyond 32 GB of DRAM, the utility of additional memory (and a larger page cache) is considerably smaller than the loss in performance due to a smaller NVCache. YCSB-A, whose write intensity prevents it from deriving benefits of any additional caching, suffers the overall loss in terms of performance/$. Its performance drops at a steeper rate than the memory cost as we decrease the amount of DRAM. We conclude that NVRAM is a cost-effective method of reducing memory cost while balancing the impact on performance for read-dominant workloads. At the same time, even a modest presence of writes can render NVRAM unprofitable relative to DRAM. References J. Izraelevitz, et al. Basic Performance Measurements of the Intel Optane DC Persistent Memory Module . arXiv:1903.05714. We Replaced an SSD with Storage Class Memory. Here is What We Learned by Sasha Fedorova. The MongoDB Engineering Journal. Jian Yang, et al. An Empirical Guide to the Behavior and Use of Scalable Persistent Memory . USENIX File Access and Storage Conference (FAST 2020) . Yahoo! Cloud Serving Benchmark, Git Repo . B.F. Cooper, et al. Benchmarking Cloud Serving Systems with YCSB . SoCC '10: Proceedings of the 1st ACM Symposium on Cloud Computing . Open Cache Acceleration Software . Open CAS Linux — Admin Guide . H.T. Kassa, et al. Improving Performance of Flash Based Key-value Stores Using Storage Class Memory as a Volatile Memory Extension . USENIX Annual Technical Conference (USENIX ATC 21) .

July 25, 2022
Engineering Blog

Connected Data: How IoT Will Save Healthcare and Why MongoDB Matters

Over the next decade, healthcare systems around the world will face a two-fold challenge: Delivering higher quality care while managing rising costs, and doing so for increasingly larger populations of patients. For decades, healthcare systems have operated predominantly with traditional fee-for-service models, in which reimbursements are given to providers based on services rendered. Value-based healthcare, in contrast, attempts to lower the cost of care by keeping patients healthier longer through more effective and efficient use of healthcare systems. This article — Part 2 of our series on connected healthcare data — looks at how IoT, with support from MongoDB, can help meet future healthcare challenges. Read Part 1 of this series on connected healthcare data Increased demand It's expected that by 2050, 22% of the world's population will be over 60 years old . This adds increased pressure to the goals of optimizing both patient outcomes and healthcare spend, because there are more people within healthcare systems than ever before. And, as these patients live longer, they experience more chronic conditions and, therefore, require more care. Constraints on the ability to graduate enough doctors and nurses to meet this surge of healthcare demand suggest that innovation will be needed to provide adequate supply. Additionally, many healthcare services are delivered in an exam or hospital room, where patient vitals and observations are captured, a chart is reviewed, and medications and treatments are ordered. According to a recent study from the Annals of Internal Medicine , providers spend more than 16 minutes per encounter on these tasks alone. Observation and data collection in healthcare is critical to identifying and subsequently adjusting treatment pathways; however, the process is heavily reliant on in-person visits. How IoT will save healthcare Global adoption of the Internet of Things (IoT) is soaring across numerous industries. In fact, healthcare is forecasted to be the second largest industry in value for IoT by 2030. IoT offers the ability to remotely monitor patients via wearables and connected devices. It provides the means to collect data beyond the patient exam or hospital room and can help providers deliver care outside of traditional, in-person methods. With this power to collect more information, more often, and do so with fewer patient encounters, IoT plays a role in solving the two-fold challenge of delivering better quality of care for increasingly larger populations of patients. A patient wearing a smartwatch, for example, may be able to stream heart rate and oxygen saturation levels during real-world activities to an electronic healthcare record, where the data can be aggregated and summarized for a physician to review, or even for a machine-learning algorithm to periodically interrogate. IoT devices can help collect more data, more often, to help providers deliver more meaningful, timely, and impactful healthcare recommendations and treatments to patients. Through this added value, IoT can further the benefits of telemedicine and promote the idea of “care anywhere,” in which healthcare is not directly tied to or dependent upon in-person encounters. Challenges of healthcare data on the move What challenges face developers when it comes to capturing and leveraging data from healthcare IoT devices? Four significant capabilities top the list, which we will look at in turn: Scalable and efficient storage Global coverage and data synchronization Interoperability Security and privacy Scalable and efficient storage IoT devices have the capability to produce massive volumes of continuous data. In fact, market intelligence provider International Data Corporation (IDC) predicts that IoT devices alone will produce 74.9 ZB of data by 2025, from a staggering 55.9 billion devices. A cloud-based developer data platform will be critical to support these kinds of massive data volumes, which may also exhibit unpredictable peaks in workloads. Additionally, as is the case for many IoT use cases, often only the most recent data is used for analysis. In this scenario, the ability to automatically archive raw and historical data to a more cost-effective storage, and yet still be able to query it when and if needed, would be ideal. MongoDB’s Atlas Online Archive lets developers do just that, with minimal setup and configuration required, as shown in Figure 1. Figure 1. MongoDB automates data tiering while keeping it queryable with Atlas Online Archive. Not all databases are ready to deal with the massive, continuous data generated by IoT devices. Sensor data is typically collected with high frequency, which may mean high concurrency of writes, unpredictable workload peaks, and the need for dynamic scalability. Additionally, IoT data is almost by definition time-series data, meaning it typically comes with a timestamp that allows following the evolution of a parameter through time, at regular or irregular time intervals. Storing time-series data efficiently at scale can be difficult. In fact, specialized time-series databases exist to tackle workloads such as these. Additionally, storing the data is simply one side of the challenge. Another aspect involves running analytics as the data is collected, such as discovering heart anomalies and sending alerts in real time to the patient. Using specialized time-series databases solves these challenges but also introduces new ones: Developers will need to learn the nuances of working with a niche platform, slowing development cycles. Building and maintaining ETL pipelines to move data and merge data across different platforms. Integrating, securing, and maintaining an additional database platform, thereby increasing operational overhead. MongoDB's new time series collection feature allows you to automatically optimize your schema and deployment for high storage efficiency and low-latency queries, without the need of an additional, niche database. Additionally, MongoDB integrates time-series data with operational data and analytics capabilities in one unified environment with built-in scalability, delivering the performance your IoT applications need while simplifying your architecture. Global coverage and data synchronization For many IoT scenarios, users are effectively on the move: They go to work, they go shopping, and they get on planes to see the new beautiful shiny star on top of Barcelona's Sagrada Família. With all of this mobility, they might lose connectivity for a few minutes or even hours. Tracking their health effectively in real time is not just a nice feature, it may be mandatory. Using MongoDB’s Atlas Device Sync , developers can easily deploy IoT applications that seamlessly handle drops in connectivity, without missing critical write operations of the most important data workloads. Interoperability Most IoT devices use proprietary protocols and operating systems, which seriously limit interoperability. The IoT industry advocates the use of standard communication protocols such as MQTT, but, as of this writing, there is no single industry standard. Custom solutions exist that serve one single type of sensor and/or healthcare provider, but these solutions tend to suffer from interoperability challenges when interlinking data across different healthcare networks. As discussed in our first post , sharing healthcare data across different participants of the healthcare ecosystem requires standards such as JSON-based FHIR, which is key to mitigate healthcare fragmentation. Learn how we used MongoDB and MQtt to "listen" and "talk" remotely to an IoT-powered facility. Downloadable code available. Security and privacy Given its sensitive and personal nature (and relatively easy monetization through theft), health data is especially appealing to bad actors. The number of security incidents impacting healthcare systems is sobering. According to a report by Crowdstrike , 82% of health systems experienced some form of IoT cyberattack in 2020. With IoT proliferation on the rise, the need for the highest level of security at the application level and at the database level, becomes non-negotiable. Unsurprisingly, McKinsey cites interoperability, security, and privacy as major headwinds for IoT adoption, especially for healthcare. How MongoDB supports IoT challenges Here's a visual view of how MongoDB helps developers bring IoT applications to market faster: Scalability and efficient storage Global coverage and data synchronization High availability and scalability are built in via replication and native sharding. Online Archive automatically archives aged data to a fully managed cloud object storage, so you can optimize cost and performance without sacrificing data accessibility. Time series collections automatically optimize your schema for high storage efficiency, low-latency queries, and real-time analytics. MongoDB Atlas is a global, multi-cloud platform that lets your apps run anywhere in the world. Atlas Device Sync solves conflict resolution and keeps your data up to date across devices, users, and your backend, regardless of connectivity. Interoperability Security and privacy The document model provides a flexible schema and maps exactly to the objects that developers work with in their code. Different industry communication standards are being built over JSON, such as FHIR, which is a natural fit to MongoDB's document model. Thanks to MongoDB Client-side Field Level Encryption , data is encrypted in motion, in memory, and at rest. Queryable Encryption allows running expressive queries on fully randomized encrypted data. MongoDB provides the strongest levels of data privacy and security for regulated workloads. MongoDB Atlas takes care of the backend, removing friction from the development process and simplifying your technology stack, so you can focus on building differentiating features for your applications. Atlas is a developer data platform that supports a broad array of use cases, from operational to transactional and through analytical workloads. Atlas also offers the following features: Ability to service more loads of the data lifecycle: Enabling development teams to seamlessly analyze, transform, and move data while reducing reliance on batch processes or ETL jobs Built on a modern data model: Aligning to the way developers think and code Integrated: Delivering an elegant developer experience Figure 2. Atlas is a developer data platform built on three pillars: the document model, a unified interface for different data use cases, and a multi-cloud, enterprise-ready foundation. MongoDB for IoT-powered healthcare apps IoT and specifically wearables will play a major role in solving the two-fold challenge of delivering better quality care for increasingly larger populations of patients. The soaring adoption of wearables is accelerating the need for a developer data platform that helps software delivery teams build and manage health applications with: Scalable and efficient storage Global coverage and data synchronization Interoperability Security and privacy MongoDB Atlas is a developer data platform designed to manage the heavy lifting for you, by providing an elegant developer experience and unifying a broad range of data workloads with world-class privacy and security features. Read Part 1 of this series on connected healthcare data , and learn more about MongoDB Atlas and the healthcare industry .

July 25, 2022
Applied

Mobile Edge Computing, Part 1: Delivering Data Faster with Verizon 5G Edge and MongoDB

As you’ve probably heard, 5G is changing everything, and it’s unlocking new opportunities for innovators in one sector after another. By pairing the power of 5G networks with intelligent software, customers are beginning to embrace the next generation of industry, such as powering the IoT boom, enhancing smart factory operations, and more. But how can companies that are leveraging data for daily operations start using data for innovation? In this article series, we’ll explore how the speed, throughput, reliability, and responsiveness of the Verizon network, paired with the sophistication of the next generation MongoDB developer data platform, are poised to transform industries including manufacturing, agriculture, and automotive. Mobile edge computing: The basics Companies everywhere are facing a new cloud computing paradigm that combines the best experiences of hyperscaler compute and storage with the topological proximity of 5G networks. Mobile edge computing , or MEC, introduces a new mode of cloud deployments whereby enterprises can run applications — through virtual machines, containers, or Kubernetes clusters — within the 5G network itself, across both public and private networks. Before we dive in, let’s define a few key terms: What is mobile edge computing? The ability to deploy compute and storage closer to the end user What is public mobile edge computing? Compute and storage deployed with the carrier data centers What is private mobile edge computing? On-premise provisioned compute and storage Verizon 5G Edge , Verizon’s mobile edge compute portfolio, takes these concepts from theoretical to practical. By creating a unified compute mesh across both public and private networks, Verizon 5G Edge produces a seamless exchange of data and stateful workloads — a simultaneous deployment of both public and private MEC best characterized as a hybrid MEC. In this article, we’ll primarily focus on public MEC deployment. Although MEC vastly increases the flexibility of data usage by both practitioners and end users, the technology is not without its challenges, including: Deployment: Given a dynamic fleet of devices, in an environment with 20-plus edge zones across both public and private MEC, to which edge(s) should the application be deployed? Orchestration: For Day 2 operations and beyond, what set of environmental changes, — be it on the cloud, network, or on device(s) — should trigger a change to my edge environment? Edge discovery: Throughout the application lifecycle, for a given connected device, which edge(s) is the optimal endpoint for connection? Fortunately for developers, Verizon has developed a suite of network APIs tailored to answer these questions. From edge discovery and network performance to workload orchestration and network management, Verizon has drastically simplified the level of effort required to build resilient, highly available applications at the network edge without the undifferentiated heavy lifting previously required. Edge discovery API workflow Using the Verizon edge discovery API, customers can let Verizon manage the complexity of maintaining the service registry as well as identifying the optimal endpoint for a given mobile device. In other words, with the edge discovery API workflow in place of the self-implemented latency tests, a single request-response would be needed to identify the optimal endpoint, as shown in Figure 1. Figure 1. &nbsp; A single request-response is used to identify the optimal endpoint Although this API addresses challenges of service discovery, routing, and some advanced deployment scenarios, other challenges exist outside of the scope of the underlying network APIs. In the case of stateful workloads, for example, how might you manage the underlying data generated from your device fleet? Should all of the data live at the edge, or should it be replicated to the cloud? What about replication to the other edge endpoints? Using the suite of MongoDB services coupled with Verizon 5G Edge and its network APIs, we will describe popular reference architectures for data across the hybrid edge. Delivering data with MongoDB Through Verizon 5G Edge, developers can now deploy parts of their application that require low latency at the edge of 4G and 5G networks using the same APIs, tools, and functionality they use today, while seamlessly connecting back to the rest of their application and the full range of cloud services running in a cloud region. However, for many of these use cases, a persistent storage layer is required that extends beyond the native storage and database capabilities of the hyperscalers at the edge. Given the number of different edge locations where an application can be deployed and consumers can connect, ensuring that appropriate data is available at the edge is critical. It is also important to note that where consumers are mobile (e.g., vehicles), the optimal edge location can vary. At the same time, having a complete copy of the entire dataset at every edge location to cater for this scenario is neither desirable nor practical due to the potentially large volumes of data being managed and the associated multi-edge data synchronization challenges that would be introduced. The Atlas solution The solution requires having an instantaneous and comprehensive overview of the dataset stored in the cloud while synchronizing only required data to dedicated edge data stores on demand. For many cases, such as digital twin, this synchronization needs to be bi-directional and may potentially include conflict resolution logic. For others, a simpler unidirectional data sync would suffice. These requirements mean you need a next-gen data platform, equipped with all the power to simplify data management while also delivering data in an instant. MongoDB Atlas is the ideal solution for the central, cloud-based datastore. Atlas provides organizations with a fully managed, elastically scalable application data platform upon which to build modern applications. MongoDB Atlas can be simultaneously deployed across any of the three major cloud providers (Amazon Web Services, Microsoft Azure, and Google Cloud Platform) and is a natural choice to act as the central data hub in an edge or multi-edge based architecture, because it enables diverse data to be ingested, persisted, and served in ways that support a growing variety of use cases. Central to MongoDB Atlas is the MongoDB database, which combines a flexible document-based model with advanced querying and indexing capabilities. Atlas is, however, more than just the MongoDB database and includes many other components to power advanced applications with diverse data requirements, like native search capabilities, real-time analytics, BI integration, and more. Read the next post in this blog series to explore the real-world applications and innovations being powered by mobile edge computing.

July 21, 2022
Applied

Ready to get Started with MongoDB Atlas?

Start Free