high-performance-storage

2157 results

Hear From the MongoDB World 2022 Diversity Scholars

The MongoDB Diversity Scholarship program is an initiative to elevate and support members of underrepresented groups in technology across the globe. Scholars receive complimentary access to the MongoDB World developer conference in New York, on-demand access to MongoDB University to prepare for free MongoDB certification, and mentorship via an exclusive discussion group. This year at MongoDB World, our newest cohort of scholars got the opportunity to interact with company leadership at a luncheon and also got a chance to share their experience in a public panel discussion at the Community Café. Hear from some of the 2022 scholars, in their own words. Rebecca Hayes, System Analyst at Alliance for Safety and Justice I did an internal transition from managing Grants/Contracts to IT and just finished a data science certificate (Python, Unix/Linux, SQL) through my community college. My inspiration for pursuing STEM was wanting to understand how reality is represented in systems and how data science can be used to change the world. What was your most impactful experience as part of the Diversity Scholarship? Most impactful were the conversations I had with other attendees at the conference. I talked to people from all sectors who were extremely knowledgeable and passionate about shaping the future of databases. The opportunity to hear from MongoDB leaders and then understand how the vision behind the product was being implemented made me feel inspired for my future in STEM. How has the MongoDB World conference inspired you in your learning or your career path? MongoDB World inspired me to understand the real world applications of databases. I left knowing what's possible with a product like MongoDB and the limits of SQL and traditional databases. After the conference, I wrote this article on Medium reflecting on what I learned at the conference. What is your advice to colleagues pursuing STEM and/or on a similar path as you? Embrace what makes you unique. Just because things take time doesn't mean they won't happen. When learning programming and data science, think about how your work relates to the real world and share those thoughts with others. Seek out new perspectives, stay true to yourself, and keep an open mind. Delphine Nyaboke, Junior Software Engineer at Sendy I am passionate about energy in general. My final year project was on solar mini-grid design and interconnection. I have a mission of being at the intersection of energy and AI What inspired me to get into tech is the ability to solve societal problems without necessarily waiting for someone else to do it for you. This can be either in energy or by code. What was your most impactful experience as part of the Diversity Scholarship? My most impactful experience, apart from attending and listening in on the keynotes, was to attend the breakout sessions. They had lovely topics full of learnings and inspiration, including Engineering Culture at MongoDB; Be a Community Leader; Principles of Data Modeling for MongoDB; and Be Nice, But Not Too Nice just to mention but a few. How has the MongoDB World conference inspired you in your learning or your career path? MongoDB World has inspired me to keep on upskilling and being competitive in handling databases, which is a key skill in a backend engineer like myself. I will continue taking advantage of the MongoDB University courses and on-demand courses available thanks to the scholarship. What is your advice to colleagues pursuing STEM and/or on a similar path as you? STEM is a challenging yet fun field. If you’re tenacious enough, the rewards will trickle in soon enough. Get a community to be around, discuss what you’re going through together, be a mentor, get a mentor, and keep pushing forward. We need like-minded individuals in our society even in this fourth industrial revolution, and we are not leaving anyone behind. Video: Watch the panel in its entirety Raja Adil, Student at Cal Poly SLO Currently, I am a software engineer intern at Salesforce. I started self-teaching myself software development when I was a junior in high school during the COVID-19 pandemic, and from there I started doing projects and gaining as much technical experience as I could through internships. Before the pandemic I took my first computer science class, which was taught in C#. At first, I hated it as it looked complex. Slowly, I started to enjoy it more and more, and during the pandemic I started learning Python on my own. I feel blessed to have found my path early in my career. What was your most impactful experience as part of the Diversity Scholarship? My most impactful experience was the network and friends I made throughout the four days I was in New York for MongoDB World. I also learned a lot about the power of MongoDB, as opposed to relational databases, which I often use in my projects. How has the MongoDB World conference inspired you in your learning or your career path? The MongoDB World conference was amazing and has inspired me a ton in my learning path. I definitely want to learn even more about MongoDB as a database, and in terms of a career path, I would love to intern at MongoDB as a software engineer down the line. What is your advice to colleagues pursuing STEM and/or on a similar path as you? My advice would be to network as much as you can and simply make cool projects that others can use. Evans Asuboah, Stetson University I am an international student from Ghana. I was born and raised by my dad, who is a cocoa farmer, and my mum, who is a teacher. I got into tech miraculously, because my country's educational system matches majors to students according to their final high school grades. Initially, I wanted to do medicine, but I was offered computer science. I realized that computer science could actually be the tool to help my community and also use the knowledge to help my dad on the farm. What was your most impactful experience as part of the Diversity Scholarship? The breakout room sessions. As scholars, we had the chance to talk to MongoDB employees, and the knowledge and experiences changed my thoughts and increased my desire to persevere. I have learned never to stop learning and not to give up. How has the MongoDB World conference inspired you in your learning or your career path? Meeting these amazing people, connecting with the scholars, being at the workshops, and talking to the startups at the booths has made me realize the sky is the limit. I dare to dream and believe until I see the results. What is your advice to colleagues pursuing STEM and/or on a similar path as you? 1. Explore MongoDB; 2. You are the only one between you and your dream; 3. Take the initiative and meet people; 4. Never stop learning. Daniel Erbynn, Drexel University I love traveling and exploring new places. I am originally from Ghana, and I got the opportunity to participate in a summer program after high school called Project ISWEST, which introduced me to coding and computer science through building a pong game and building an Arduino circuit to program traffic lights. This made me excited about programming and the possibilities of solving problems in the tech space. What was your most impactful experience as part of the Diversity Scholarship? My most impactful experience was meeting with other students and professionals in the industry, learning from them, making lifelong connections, and getting the opportunity to learn about MongoDB through the MongoDB University courses. How has the MongoDB World conference inspired you in your learning or your career path? This conference has inspired me to learn more about MongoDB and seek more knowledge about cloud technology. What is your advice to colleagues pursuing STEM and/or on a similar path as you? Don’t be afraid to reach out to people you want to learn from, and create projects you are passionate about. Build your skills with MongoDB University's free courses and certifications . Join our developer community to stay up-to-date with the latest information and announcements.

August 12, 2022

AWS and MongoDB: Partners in Reliable, Resilient Cloud Environments

Security is increasingly critical for application development. While the volume of applications developed, distributed, used, and patched over networks is rapidly expanding, so, too, are cyberattacks and data breaches, many of which happen at the web application layer. As more organizations move to the cloud, it’s imperative for customers to know who’s responsible for what when it comes to security. Understanding these roles and responsibilities is crucial for ensuring cloud workloads remain secure and available. MongoDB and AWS are working together to simplify and strengthen data security for our customers so they can focus on developing great applications and user experiences. For more information on shared responsibility, read the first blog in this series . Shared responsibility in the cloud Back when most IT environments lived on premises, the responsibility of securing the systems and networked devices fell squarely on the owner of the assets — usually the business owner or a managed service provider. Today, with the prevalence of cloud applications, hybrid environments, and pay-as-you-go services, it is often not clear who's responsible for what when it comes to securing those environments, services, and the data they contain. For this reason, the shared responsibility model of cloud security has emerged. Under the shared responsibility model, some security responsibilities fall on the business, some on public cloud providers, and some on the vendors of the cloud services being used. When you deploy a MongoDB Atlas database on AWS, the database is created on infrastructure operated, managed, and controlled by AWS, from the host operating system and virtualization layer down to the physical security of the AWS data centers. MongoDB is responsible for the security and availability of the services we offer — and for everything within the scope of our responsibilities as a SaaS vendor. Customers are responsible for the security of everything above the application layer — accounts, identities, devices, and data — plus the management of the guest operating system, including updates and security patches; associated application software; and the configuration of the AWS-provided security group firewall. (See Figure 1.) Figure 1.   Shared responsibility when using MongoDB Atlas. Strategic partners in data solutions MongoDB Chief Information Security Officer Lena Smart delivered a keynote at AWS re:Inforce , an event where security experts offered tips and best practices for securing workloads in the cloud, and was also interviewed by theCUBE . Smart noted how MongoDB and AWS are working together to enable our joint customers to focus more on business objectives while having the confidence in the cloud services and infrastructure they get from us. "You want to worry less about security so that you can focus on application development, performance, availability, business continuity, data management, and access," Smart said. "As the CISO of MongoDB, these concerns are also my top concerns as we work to better serve our global customer base. And we are very appreciative of the opportunity to do this in lockstep with AWS." Jenny Brinkley, Director, AWS Security, agrees that customers stand to benefit through the shared responsibility model. "The shared responsibility model is a huge reason why more customers are deploying in the cloud," Brinkley said. "AWS, combined with marketplace services like MongoDB Atlas, help relieve the customer's operational burden so they can focus on driving their businesses forward." Smart's appearance at the event is just one example of how MongoDB and AWS are working together to deliver scalable data intelligence solutions for enterprise data in the cloud, reduce risk for cloud-native tools, and enable our joint customers to achieve compliance and protect their sensitive data. Thanks to our strategic partnership, organizations around the globe and across a wide range of industries — from banking and airlines to insurance and e-commerce — are better able to discover, manage, protect, and get more value from their regulated, sensitive, and personal data across their data landscape. MongoDB Atlas is trusted by organizations with highly sensitive workloads because it is secure by default. We're constantly innovating with new, breakthrough technologies, like our industry-first queryable encryption, which allows customers to run rich, expressive queries on fully randomized encrypted data, improving both the development process and the user experience. MongoDB Atlas is designed to be secure by default. Try it for free . MongoDB Atlas (Pay as You Go) is now available in AWS Marketplace — try it today .

August 11, 2022

New in Atlas Search: Improve Content Recommendations With “More Like This”

We’re proud to announce the release of More Like This, a key MongoDB Atlas Search feature that allows developers to easily build more relevant and engaging experiences for their end users. With the moreLikeThis operator, you can display documents that are similar to a result document. In this article, we’ll explain how it works and how you can get started using this new feature. Content recommendation done easily People who use travel booking apps, streaming services, and e-commerce websites are likely familiar with “Frequently Bought With,” “Similar Products,” or “You Might Also Enjoy” sections in their search experiences — in other words, content recommendation that guides them toward new or related products to buy, movies to stream, recipes to make, or news articles to read (among other things). Instead of building and tuning a recommendation engine to provide this functionality, developers can create engaging, browsable search experiences by defining a similarity threshold between documents to surface relevant documents. How it works Under the hood, the moreLikeThis search operator extracts the most representative terms from a reference document or documents and returns a set of similar documents. The representative terms are selected based on term frequency-inverse document frequency (TF-IDF), which is calculated by looking at a given term’s frequency in a given document multiplied by its frequency in the corpus. TF-IDF is calculated by looking at a term’s frequency multiplied by its frequency in the corpus. Atlas Search indexes term frequency by default, which means there is less up-front configuration required when compared with other search solutions. Additionally, developers have the ability to define what constitutes sufficient similarity for their use cases, with control over variables such as the number of query terms selected and the minimum and maximum document frequency thresholds. Use cases An example use case might look like this: An online bookstore wants to upsell users who have reached the checkout stage with similar books. On the checkout page, the user is served with a More Like This query result in the form of an “Other Books You Might Like” section that contains an array of book titles based on multiple fields in the document (e.g., title, publisher, genre, author). More Like This can be applied to use cases like ecommerce, content management systems, application search, or anywhere you want to share more relevant content with your users to drive deeper engagement. For more examples of how to configure More Like This, refer to our examples in the Docs . To learn how to get started with More Like This, refer to our documentation . For real-world Atlas Search implementation examples, go to our Developer Center .

August 10, 2022

4 Common Misconceptions About Security That Hackers Hope You Don't Know

I’ve always thought hacking was harder than it looks on TV. But after two years exploring the world of ethical hacking, I’ve learned that almost anyone can hack almost anything if they have sufficient knowledge and think creatively. Practically every week, we hear about another data breach at a large organization or a new vulnerability in a widely used software package. Many of these exploits are the result of misconceptions about security, which often lead to security misconfigurations. In this post, I'll cover the four most common security misconceptions and explain how hackers leverage them to execute attacks. I'll also explain how MongoDB can help you protect your data against these attacks. Watch Dawid Esterhuizen's MongoDB World 2022 presentation, Hack the MongoDB Planet! 1. NoSQL Injection The first rule of secure web development is: Never trust user input. This brings me to the first common misconception: MongoDB doesn’t use SQL, so I don’t have to worry about SQL injection. I’ve spoken with countless developers who have this false sense of security. The truth is, as NoSQL databases gain an increasing share of the database market, attackers are starting to pay attention, and new exploit tools are starting to emerge, like NoSQLmap (a NoSQL version of SQLmap), which allows hackers to perform automated NoSQL injections on vulnerable applications and endpoints. But enough theory. Let’s start hacking (ethically). Shown below is an example of a vulnerable application endpoint. The application uses the req.query.user and req.query.pass , which were provided by the user, as the values for name and password. The query string takes the username and password, builds a query, and checks to see if the user exists. app.get('/', async (req, res) => { query = {name: req.query.user, password: req.query.pass} user = await client.db("secret_bank").collection("users").findOne(query) if (user){ res.send('Success') } else { res.send('Failed') } } Vulnerable application endpoint. This is a fairly basic authentication method, but one that can be easily exploited with some command line tools. If you enter the correct username and password information, you can see the presence of strings on the server. (I’ve added some extra logging so you can see what happens on the server.) Normal login. The username and password are confirmed on the server and everything seems normal. MongoDB queries are usually built with objects and arrays. The question from an attacker’s point of view is, can they inject those onto the server? One of the interesting and not-very-well-known features of URL query parameters is that if you add a square bracket to a parameter in the URL, it is converted into an array. As you can see from this example, I added the password, notRight , as an array on the server. Add square brackets to the parameter to see the presence of an array. Say you’re an attacker and you want to inject a list of two values. You just have to do it twice with the same parameter and you’ll get an array list of values on the server. If you add square brackets to the same parameter twice in the command line, you'll get an array list of values on the server. This would definitely be useful for an attacker seeking to inject NoSQL into the application. The next question an attacker might ask is, can they inject an object on the server? In any programming language, if you set a field name inside the square brackets you get an object. Here, you see the same with an object on the server. If you set the field name inside square brackets, you'll see the object on the server. So the attacker can now send objects to the server, but how do they exploit that? In an SQL injection, the most basic authentication bypass you can do is quote (") or one equals one (1=1) double dash. This query results in true every time. Bypass login in SQL. The SQL query above will evaluate if the user is equal to admin, which in our case is true. Then it checks if the password is equal to an empty string (due to our quote), or if one is equal to one, which is always true. The query will therefore always return the ID of the username if it exists. There are a few NoSQL options that will do the same. If your password is actually set, you can run a query that checks if it's greater than nothing, which it always is, and use that to bypass authentication. Similar to an SQL injection, there are a few NoSQL queries hackers can run to bypass application authentication. This is a vulnerability in the application, not the database. The application assumes that the user is sending a string for the password and user. If the application forces the password and username to a string with .toString() , no matter what input the user gives, it is not parsed. A lot of this is pretty obvious, but let's take it a step further. Shown below are a few lines of Python script that iterate over some characters and play a little with regex, and then do some blind NoSQL injection. By doing that, I'll demonstrate how we can pull some data out of the database. baseURI="http://localhost:3000/?user=admin&pass[$regex]=^" chars=list(string.ascii_lowercase + string.ascii_uppercase + string.digits) count=0 while (not found): currentCharacter=chars[count] url=baseURI + foundString + currentCharacter + "" if requests.get(url).text == "Success": count=0 foundString+=currentCharacter url=baseURI+foundString+"$" found = requests.get(url).text == "Success": else: count+=1 Once we kick off the exploit, the script continuously iterates off every character. This Python script performs a blind NoSQL injection to extract a password. The main objective here is to extract the clear text password, since that was our vulnerable parameter in this case. This exploit is 13 lines of code that took about 10 minutes to write. With NoSQLmap, this could be even easier. Now, I know all you Java and C# developers are out there thinking your applications are safe. You'll be surprised to know that you're still vulnerable. All it takes is a very interesting approach to query building by parsing JSON strings as queries. In 2021, a quite popular open source chat server identified this exploit, using regex blind injection. The attackers in this case pulled out the reset tokens for users. Once they were able to reset a user’s password, they escalated the user’s privileges, making them an admin. Once an admin, they could set up integration, which is a fancy word for running code on your server. And once that happens, you've been pwned. So it bears repeating: Never trust any input from any user, ever! 2. Social engineering attacks People and businesses get hacked every day, and most security attacks and data breaches use social engineering. Great hackers will only let you know about their presence when they want you to know about it. This brings me to the second common misconception: I don't need authentication because the database will always be running on a private network. Authentication is the most basic security feature. Whether you're using SCRAM-SHA, LDAP, x509, or Kerberos, if you do not lock the front door, someone will gain access, and social engineering is the preferred method. All it takes is someone innocently clicking on a malicious file, which results in a macro trojan that gives the attacker access to the network. Once they're in your network, they can scan for open ports, then test if authentication is enabled for the database. Network scan. Once they find a database without authentication, you've been pwned. You've been pwned. It's that simple. 3. TLS and network traffic Let's say you have now enabled authentication, your data is now behind a lock and a strong password (at least 16 characters using both uppercase and lowercase letters, numbers, and symbols). You have even moved credentials into a secrets vault to stop attackers from reading them out of the source code or config files. You're safe, right? Well, this brings me to the third common misconception: We don't need TLS, MongoDB uses a binary protocol. Unfortunately, with the popularity of bring-your-own-device options, employees often install software from unofficial sources without any permission or oversight from IT. Vulnerabilities in third-party software can easily lead to unauthorized access to the network by a malicious actor. Once again, they can perform a scan and check to see if authentication is set up, and it is. Checking authentication. At this point, they can attempt a brute force attack, but that's a lot of work, and hackers prefer easier approaches. Instead, they start sniffing the network and find multiple packets for the MongoDB database. Sniffing the network. They can intercept the traffic and send it to their own machine. Once they have the data on their machine, and if they know their tools well (Tshark or Wireshark would help), they can access a PCAP file. And if they output it as JSON, they can use jq to manipulate it. The example below shows BSON as hexadecimal code. BSON dump. The CyberChef tool has a decoder for BSON. Decoding BSON. No username is required. And once again, you've been pwned. The lesson here is that TLS should always be used to ensure any data that is transferred between two systems is encrypted and, even if intercepted, cannot be decrypted without the private key. Retrieving files after they've been deleted So now you have authentication with strong passwords, the application is safe, the network is safe, and you’re using encryption at rest. You’re totally secure, right? Nope. This brings me to the fourth common misconception: Encryption at rest will protect me against hackers. A 2014 study found that 78% of drives that were sold for reuse after decommissioning and allegedly being wiped still had data on them; 23% of those still had associated social security numbers, and 21% had financial information. Needless to say, this is a problem. Hackers who use a keyfile for encryption at rest can retrieve files from an old, discarded drive or one that was not securely wiped. And once again, you've been pwned. The best defense against this sort of vulnerability is to always use a key management interoperability protocol (KMIP) server for encryption at rest. Malicious insiders It's important to remember that if an attacker gains entry to your system while it's running, they can get ahold of your data, no matter what system you're using. This brings me to an attack vector that is one of the biggest risks for businesses right now: malicious insiders. According to a recent study , 70% of organizations are seeing an increase in insider attacks. It can take up to 200 days to identify these attacks. Around the globe, 35% of businesses have been affected. Imagine a trusted employee accepts a job with one of your competitors and, in the process of accepting a position with them, decides to take your database with them. They don't want to use their credentials to access it for fear of being detected. They could dump the memory and get what's in the cache, but that's a lot of work. They could also dump the memory and look for the encryption keys, which is also a lot of work. Alternatively, they could just use the keyfile and dump the data. And once again, you've been pwned. Running host with KMIP. Insider attacks are some of the hardest to protect against because of the level of knowledge and access of employees. If it seems like no one is safe, that's what I'm trying to show you. If someone makes it onto your server while it's running, all bets are off, especially if they have root access. client-side field-level encryption encrypts the data on the application server before it's sent to the database — not even your DBAs can read the data. The keys that are being used are also in a key management system (KMS). Client-side field-level encryption should be used for all sensitive information. Secure by default Let's face it, security configuration is a full-time job, and you already have one of those. MongoDB Atlas provides a database that is secure by default. At a minimum, it always requires a username and password, it always requires you to use TLS, and there are network and IP access lists that further restrict access to the database. MongoDB Atlas uses a zero-trust design that complies with all the major regulatory frameworks organizations are subject to. When you use client-side field-level encryption for applications outside of mongoDB Atlas and put your data inside of Atlas, which is secure by default, you've added yet another layer between malicious insiders and your data. Try a preview version of Queryable Encryption to encrypt data end-to-end and query on randomly encrypted data. Try MongoDB Atlas for free.

August 10, 2022

MongoDB and IIoT: Data Streaming With Kafka

Event streaming has become a cornerstone of the industrial internet of things (IIoT) because it allows people to unleash the power of real-time operational data to drive applications and analytics. In this article, we share how MongoDB Atlas helps you move data seamlessly from the MQTT protocol into MongoDB time series collections using the Apache Kafka MQTT source and MongoDB sink connectors deployed in a cloud environment. Read the first and second articles in this four-part series on MongoDB and IIoT. Data streaming is the second step in our framework for end-to-end data integration in the manufacturing sector. The “connect” step of this framework deals with establishing an interface for interaction with IoT devices . The methodology discussed in this blog was developed and tested using a model factory created by Fischertechnik , but these steps are applicable to any environment that uses the standard MQTT protocol. All the source code for this project, along with a detailed deployment guide, can be found on our public Github repository. Figure 1. &nbsp;Step 2 of the end-to-end data integration framework. The challenge of collecting data On the shop floor, devices and components are continuously generating data related to their activity and environmental conditions at regular time intervals, typically known as time series data. In our factory model production line, there are a variety of sensors collecting data about temperature, pressure, humidity, brightness, camera positions, device/inventory status, and movements. This data is vital to monitor the health and effectiveness of factory equipment and its ability to continue to function without failure. The resulting datasets are often huge and must be efficiently stored and analyzed to detect anomalies or provide insight into overall equipment efficiency. With the advent of powerful event streaming platforms like Apache Kafka — and the wide variety of connectors for all sorts of protocols — it has become increasingly simple to handle the consolidation and export of real-time data feeds. However, dealing with such large volumes of data comes with added challenges regarding scalable storage, cost implications, and data archiving. This is where MongoDB’s time series collections come into play. Time series collections are a distinct type of MongoDB collections, optimized to efficiently store and process time series data by leveraging clustered indexes, columnar compression, and aggregation pipeline stages to facilitate real-time analytics. Learn more about time series collections on our tutorial page . Dream team: MQTT + Kafka + MongoDB Our recipe for collecting real-time sensor data (using the MQTT protocol) combines an MQTT source connector developed by Confluent and a native MongoDB sink connector deployed in a containerized environment. Figure 2. &nbsp;The components of the data streaming methodology. In this instance, we used a similar stack that includes Kafka Connect , a Kafka broker, and ZooKeeper deployed as containers in a single Docker compose file. This setup can be deployed locally, on a serverless backend or even Confluent Cloud. In our case, we have it deployed on an AWS EC2 Linux instance. Read our tutorial on how to set up a Kafka development environment with MongoDB connectors . Here’s a brief explanation of what each container does in this environment: Zookeeper: Acts a centralized controller that manages and organizes all the Kafka brokers. Kafka broker: Allows Kafka consumers to fetch messages by topic, partition, and offset. Kafka brokers can create a Kafka cluster by sharing information between each other. Kafka Connect: Serves as the runtime environment where you can configure connectors to ingest data into Kafka topics, making the data available for stream processing with low latency. It is worth noting that Kafka allows any number of sink and source connectors to be created in its environment as long as there are no server resource restrictions. Once the development environment is set up, all the necessary parameters are configured in the source and sink connectors. The source connector The source connector allows the Kafka broker to subscribe to MQTT topics. It serves to map the MQTT topics that contain the desired data parameters to a chosen Kafka topic. For simplicity, we’ve used Confluent’s MQTT source connector , which supports any kind of MQTT broker connection (self-hosted or otherwise). We’ve also used a managed MQTT service from HiveMQ as our remote broker. In the sample source connector configuration below, we’ve streamed sensor readings from multiple MQTT topics on the factory to a single Kafka topic called sensors using a string list of MQTT topics. We added the necessary access details to the remote broker from which Kafka will consume messages from the MQTT topic and save them as JSON values. Mapping several MQTT topics to the same Kafka topic does not affect the performance of the connector. { "name": "mqtt-source", "config": { "connector.class": "io.confluent.connect.mqtt.MqttSourceConnector", "tasks.max": "1", "mqtt.server.uri": "ssl://<REMOTE BROKER ADDRESS>:8883", "mqtt.username": "<REMOTE BROKER CLIENT>", "mqtt.password": "<REMOTE BROKER CLIENT PASSWORD>", "mqtt.topics": "i/ldr,i/bme680,i/cam", "kafka.topic": "sensors", "value.converter":"org.apache.kafka.connect.converters.ByteArrayConverter", "confluent.topic.bootstrap.servers": "broker:9092", "confluent.license": "", "topic.creation.enable": true, "topic.creation.default.replication.factor": -1, "topic.creation.default.partitions": -1 }} Figure 3. &nbsp;Sensor readings from multiple MQTT topics to a single Kafka topic. The sink connector While the source connector specifies the location from which data is retrieved, the sink connector specifies the destination to which data is sent. We used the MongoDB Kafka Sink Connector , which allowed us to connect to a MongoDB Atlas cluster with the right access information and choose which database and collection the streaming data was stored in. To receive the brightness readings captured in the source connector, the topics property in this connector must be set to match the name of the kafka.topic property in the former. { "name": "mongodb-sink", "config": { "connector.class":"com.mongodb.kafka.connect.MongoSinkConnector", "tasks.max":1, "topics":"sensors", "connection.uri":"mongodb+srv://user:password@address.mongodb.net/database?retryWrites=true&w=majority", "database":"<database name>", "collection":"<collection name>", "key.converter":"org.apache.kafka.connect.storage.StringConverter", "value.converter":"org.apache.kafka.connect.json.JsonConverter", "value.converter.schemas.enable":"false", "timeseries.timefield":"ts", "timeseries.timefield.auto.convert":"true", "timeseries.timefield.auto.convert.date.format":"yyyy-MM-dd'T'HH:mm:ss'Z'", "transforms": "RenameField,InsertTopic", "transforms.RenameField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value", "transforms.RenameField.renames": "h:humidity, p:pressure, t:temperature”, "transforms.InsertTopic.type":"org.apache.kafka.connect.transforms.InsertField$Value", "transforms.InsertTopic.topic.field":"Source" }} Figure 4. &nbsp; The converter properties instruct the connector on how to translate data from Kafka. The converter properties in Figure 4 instruct the connector on how to translate data from Kafka. This configuration also automatically creates a time series collection in the requested database using the timeseries.timefield properties, which allowed us to choose which field in the original MQTT message qualifies as the timestamp and auto-convert that to a MongoDB-compatible date/time value format. Find out more about the configurable properties of our Kafka connectors in our detailed documentation . Smooth sailing with MongoDB Atlas Once the connectors have been configured and launched, Kafka listens on the mapped topics for any change events and translates this to documents in a time series collection. As long as the Kafka environment is running and connection with the MQTT broker remains unbroken, the time series collection is updated in real time and highly compressed (often more than 90%) to accommodate the continuous influx of data. See a sample of the time series collection we created in Figure 5. Figure 5. &nbsp; Streamed data saved in a MongoDB Atlas time series collection. As the expectations of consumability vary across organizations and personas, the underlying data structure can be further tailored for different use cases by using materialized views and simple aggregations. Read the first and second articles in this four-part series on MongoDB and IIoT. Since time series data is hardly ever changed and “cools down” over time, storing it in a hot data tier can become costly. To optimize costs, MongoDB Atlas provides Online Archive , which allows you to configure filter criteria to trigger automatic offloading of “cold” data to cheaper storage while maintaining its queryability. Once you start to receive accurate real-time data from the factory floor, a world of opportunity opens up in terms of getting insights from the collected data. In our next post, we will show you how to leverage the rest of the MongoDB Atlas product suite to run analytics on operational data, including using Atlas Charts for instant seamless data visualizations (see Figure 6). Figure 6. &nbsp; A sample dashboard created from factory sensor data in Atlas Charts. All the source code used in this project, along with a detailed deployment guide, is available on our public Github repo . Feel free to clone it and play around with configuring Kafka and its connectors. Most of the principles discussed in this post are applicable to any device that uses MQTT as a communication protocol. To learn more, watch our webinar session to see the full reference architecture, get tips for configuring your Kafka connectors, and see a live demonstration. If you have questions about other communication protocols or would like to consult with someone from our team about your project, please contact us .

August 10, 2022

How MongoDB Protects Against Supply Chain Vulnerabilities

Software supply chain vulnerabilities became national news in late 2020 with the discovery of the Solar Winds cyberattack. A year later, as if to put an exclamation point on the issue, the Log4j security flaw was discovered. Before these incidents, cybersecurity headlines typically focused on ransomware and phishing attacks, and organizations responded by increasing defensive measures, expanding network security beyond the perimeter, and mandating security awareness training. Protecting organizations from supply chain vulnerabilities, however, is a more complex undertaking. Download Supply Chain Security in MongoDB's Software Development Life Cycle Transparency and testing Few organizations have complete transparency into the software supply chain. The software supply chain includes all components — third-party dependencies, open source scripts, contractors, and other miscellaneous components and drivers — directly involved in developing an application. When dealing with a dozen or more vendors, applications, and service providers, it's hard to know all the elements that comprise your organization's software supply chain. As a backend solutions provider with open source roots, MongoDB is keenly aware of the need for security and transparency in the software supply chain. Long before supply chain vulnerabilities became national news, we implemented numerous safeguards to ensure the security of our products throughout the software development life cycle (SDLC). For example, in the planning stage, we look at our software from an attacker's perspective by trying to find ways to bypass authentication and gain unauthorized access. In the sprint stage, we conduct thousands of CPU hours of tests every week, and we run builds on thousands of compute nodes 24/7 on different combinations of every major hardware platform, operating system, and software language. And in the deployment stage, we perform hundreds of hours of automated testing to ensure correctness on every source code commit. We also invite the MongoDB Community and other third parties to submit reports of bugs found in our products, both open source and enterprise packages. Finally, we conduct periodic bug hunts with rewards for community members who contribute by improving a release. Securing third-party software The area that organizations have the least visibility into is perhaps the use of third-party libraries. Almost all applications use software that was written by someone else. According to some industry estimates, third-party libraries make up between 30% and 90% of typical applications. At MongoDB, all third-party libraries are evaluated and vetted by the security team before being incorporated into MongoDB products. We also use security tools to scan source code, identify known security vulnerabilities, and test against government benchmarks like Common Vulnerability and Exposure (CVE) and Common Weakness Enumeration (CWE), as well as private-entity frameworks like the SANS Institute’s list of software vulnerabilities. If we identify a vulnerability, we use the IETF Responsible Vulnerability Disclosure Process to evaluate and mitigate the issue, communicate with our user base, and perform a postmortem assessment. Details are also published to the MongoDB Alerts page along with release notes and a description of fixes. Using SBOMs To encourage even more transparency within the software supply chain, we've been at the forefront of the push for a software bill of materials (SBOM, pronounced “S-Bomb”). A software bill of materials is a list of ingredients used by an application, including all the libraries and components that make up an application, whether they are third-party, commercial off-the-shelf (COTS), or open source. By providing visibility into all of the individual components and dependencies, SBOMs are seen as a critical tool for improving software supply chain security. MongoDB’s CISO, Lena Smart, recently conducted a panel discussion with a handful of cybersecurity experts on the need for SBOMs in the wake of President Joe Biden’s executive order on supply chain security . Vulnerabilities in software will always exist, and the determination of malicious actors means that some of those vulnerabilities will be exploited. MongoDB believes that secure digital experiences start with secure software development. That means having the proper controls in place, continuously probing for weaknesses, and maintaining transparency in the CI/CD pipeline. For more detailed information, download our white paper Supply Chain Security in MongoDB's Software Development Life Cycle .

August 9, 2022

4 Critical Features for a Modern Payments System

The business systems of many traditional banks rely on solutions that are decades old. These systems, which are built on outdated, inflexible relational databases, prevent traditional banks from competing with industry disruptors and those already adopting more modern approaches. Such outdated systems are ill-equipped to handle one of the core offerings that customers expect from banks today — instantaneous, cashless, digital payments . The relational database management systems (RDBMSes) at the core of these applications require breaking data structures into a complex web of tables. Originally, this tabular approach was necessary to minimize memory and storage footprints. But as hardware has become cheaper and more powerful, these advantages have also become less relevant. Instead, the complexity of this model results in data management and programmatic access issues. In this article, we’ll look at how a document database can simplify complexity and provide the scalability, performance, and other features required in modern business applications. Document model To stay competitive, many financial institutions will need to update their foundational data architecture and introduce a data platform that enables a flexible, real-time, and enriched customer experience. Without this, new apps and other services won’t be able to deliver significant value to the business. A document model eliminates the need for an intricate web of related tables. Adding new data to a document is relatively easy and quick since it can be done without the usually lengthy reorganization that RDBMSes require. What makes a document database different from a relational database? Intuitive data model simplifies and accelerates development work. Flexible schema allows modification of fields at any time, without disruptive migrations. Expressive query language and rich indexing enhance query flexibility. Universal JSON standard lets you structure data to meet application requirements. Distributed approach improves resiliency and enables global scalability. With a document database, there is no need for complicated multi-level joins for business objects, such as a bill or even a complex financial derivative, which often require object-relational mapping with complex stored procedures. Such stored procedures, which are written in custom languages, not only increase the cognitive load on developers but also are fiendishly hard to test. Missing automated tests present a major impediment to the adoption of agile software development methods. Required features Let’s look at four critical features that modern applications require for a successful overhaul of payment systems and how MongoDB can help address those needs. 1. Scalability Modern applications must operate at scales that were unthinkable just a few years ago, in relation to both transaction volume and to the number of development and test environments needed to support rapid development. Evolving consumer trends have also put higher demands on payment systems. Not only has the number of transactions increased, but the responsive experiences that customers expect have increased the query load, and data volumes are growing super-linear. The fully transactional RDBMS model is ill suited to support this level of performance and scale. Consequently, most organizations have created a plethora of caching layers, data warehouses, and aggregation and consolidation layers that create complexity, consume valuable developer time and cognitive load, and increase costs. To work efficiently, developers also need to be able to quickly create and tear down development and test environments, and this is only possible by leveraging the cloud. Traditional RDBMSes, however, are ill suited for cloud deployment. They are very sensitive to network latency, as business objects spread across multiple tables can only be retrieved through multiple sequential queries. MongoDB provides the scalability and performance that modern applications require. MongoDB’s developer data platform also ensures that the same data is available for use with other frequent consumption patterns like time series and full-text search . Thus, there is no need for custom replication code between the operational and analytical datastore. 2. Resiliency Many existing payment platforms were designed and architected when networking was expensive and slow. They depend on high-quality hardware with low redundancy for resilience. Not only is this approach very expensive, but the resiliency of a distributed system can never be reached through redundancy. At the core of MongoDB’s developer data platform is MongoDB Atlas , the most advanced cloud database service on the market. MongoDB Atlas can run in any cloud, or even across multiple clouds, and offers 99.995% uptime. This downtime is far less than typically expected to apply necessary security updates to a monolithic legacy database system. 3. Locality and global coverage Modern computing demands are at once ubiquitous and highly localized. Customers expect to be able to view their cash balances wherever they are, but client secrecy and data availability rules set strict guardrails on where data can be hosted and processed. The combination of geo-sharding, replication, and edge data addresses these problems. MongoDB Atlas in combination with MongoDB for Mobile brings these powerful tools to the developer. During the global pandemic, more consumers than ever have begun using their smartphones as payment terminals. To enable these rich functions, data must be held at the edge. Developing the synchronization of the data is difficult, however, and not a differentiator for financial institutions. MongoDB for Mobile, in addition with MongoDB’s geo-sharding capability on Atlas cloud, offloads this complexity from the developer. 4. Diverse workloads and workload isolation As more services and opportunities are developed, the demand to use the same data for multiple purposes is growing. Although legacy systems are well suited to support functions such as double entry accounting, when the same information has to be served up to a customer portal, the central credit engine, or an AI/ML algorithm, the limits of the relational databases become obvious. These limitations have resulted in developers following what is often called “best-of-breed” practices. Under this approach, data is replicated from the transactional core to a secondary, read-only datastore based on technology that is better suited to the particular workload. Typical examples are transactional data stores being copied nightly into data lakes to be available for AI/ML modelers. The additional hardware and licensing cost for this replication are not prohibitive, but the complexity of the replication, synchronization, and the complicated semantics introduced by batch dumps slows down development and increases both development and maintenance costs. Often, three or more different technologies are necessary to facilitate the usage patterns. With its developer data platform, MongoDB has integrated this replication, eliminating all the complexity for the developers. When a document is updated in the transactional datastore, MongoDB will automatically make it available for full-text search and time series analytics. The pace of change in the payments industry shows no signs of slowing. To stay competitive, it’s vital that you reassess your technology architecture. MongoDB Atlas is emerging as the technology of choice for many financial services firms that want to free their data, empower developers, and embrace disruption. Replacing legacy relational databases with a modern document database is a key step toward enhancing agility, controlling costs, better addressing consumer expectations, and achieving compliance with new regulations. Learn more by downloading our white paper “Modernize Your Payment Systems."

August 8, 2022

Navigating the Future of Data Sovereignty With MongoDB

There are 2.5 quintillion bytes of data created every day , and more and more of that data is being stored in a public cloud. The rise of cloud data storage brings with it a focus on data sovereignty. Governments and industry regulatory bodies are cracking down on protecting user data. At any given time, organizations must know where its data is located, replicated, and stored — as well as how it is collected and processed, prioritizing personal data privacy all along the way. The challenge of GDPR compliance A PwC survey found that 92% of U.S. companies consider GDPR a top data protection priority , and rightly so, as there is pressure from both governments and citizens to protect user data. A recent Vormetric survey found that 85% of American consumers said that if significant personal consequences resulted from their information being compromised as part of a breach, they’d take their business elsewhere. Without a strong handle on data sovereignty, organizations are risking millions of dollars in regulatory fees for mishandling data, loss of brand credibility, and distrust from customers. Where to start with data sovereignty Creating a proper structure for data sovereignty can be complex, and as big data gets bigger, so will the breadth and depth of regulations. The GDPR of today may not resemble the GDPR of tomorrow, and more laws continue to be rolled out at the federal, state, and industry levels. GDPR, while the most notable, is not the only data regulation policy that businesses must consider. California has rolled out the California Consumer Privacy Act, and there are numerous countries that have similar laws in place to protect consumer data and regulate how data is managed, including Japan, India, Egypt, and Australia. And as these regulations continue to be introduced, organizations will need to keep pace to avoid damage to their businesses. Major considerations that impact data sovereignty include: Process: How is your company going to maintain compliance for data sovereignty with efficiency? Infrastructure: Is a legacy infrastructure holding you back from being able to easily comply with data regulations? Scaling: Is your data architecture agile enough to meet regulations quickly as they grow in breadth and complexity? Cost: Are you wasting time and money by leveraging manual processes to adhere to governmental regulations and risking hefty fees attached to incompliance? Penalties: Are your business leaders fully aware of the costs associated with noncompliance? GDPR violations can exact up to €10 million (an average of 2% to 4% of organizational revenue) in penalties. Learn more about strong controls for critical data privacy at our upcoming webinar on queryable encryption . Managing data sovereignty with MongoDB Atlas MongoDB enables you to easily comply with most data privacy regulations. MongoDB Atlas , our cloud database as a service, includes intuitive security features and privacy controls, including: Queryable encryption : Revolutionary to the industry and currently in preview with MongoDB 6.0, queryable encryption enables encryption of sensitive data from the client side, stored as fully randomized, encrypted data on the database server side. This feature delivers the utmost in security without sacrificing performance, ensuring that even the most critical and sensitive workloads are safe and performant in a public cloud. MongoDB Atlas global clusters : It is no longer sustainable or advantageous to build applications across geographic areas and jurisdictions. Doing so requires more infrastructure, more maintenance, more management, and, in turn, more complexity and more resources exhausted. Atlas global clusters allow organizations with distributed applications to geographically partition a fully managed deployment in a few clicks and control the distribution and placement of their data with sophisticated policies that can be easily generated and changed. This means that not only can your organization achieve compliance with regulations containing data residency requirements more easily, but you can also reduce overhead at the same time. Virtual private clouds (VPCs): Each MongoDB Atlas project is provisioned into its own VPC, thereby isolating your data and underlying systems from other MongoDB Atlas users. This allows businesses to meet data sovereignty requirements while staying highly available within each region. Each shard of data will have multiple nodes that automatically and transparently failover for zero downtime, all within the same jurisdiction. Being able to meet data residency requirements is another big technical challenge made simple with MongoDB Atlas . Further, businesses can connect Atlas VPCs to customer infrastructure via private networking (including private endpoints and VPC peering) for increased security. IP whitelists : IP whitelists allow you to specify a specific range of IP addresses against which access will be granted, delivering granular control over data. Client-side field-level encryption (CSFLE) : This feature dramatically reduces the risk of unauthorized access or disclosure of sensitive data. Fields are encrypted before they leave your application, protecting them everywhere: in motion over the network, in database memory, at rest in storage and backups, and in system logs. Dig deeper into data sovereignty To learn more about strong controls for critical data privacy, join MongoDB’s webinar on August 24, 2022 . Our experts will focus on queryable encryption, the industry’s first encrypted search scheme, and how, with MongoDB Atlas, your data is protected with preconfigured security features for authentication, authorization, encryption, and more. Register for our queryable encryption webinar on August 22, 2022 .

August 3, 2022

Introducing the Ability to Independently Scale Analytics Node Tiers for MongoDB Atlas

We’re excited to announce analytics node tiers for MongoDB Atlas! Analytics node tiers provide greater control and flexibility by allowing you to customize the exact infrastructure you need for your analytics workloads. Analytics node tiers provide control and flexibility Until now, analytics nodes in MongoDB’s Atlas clusters have used the same cluster tier as all other nodes. However, operational and analytical workloads can vary greatly in terms of resource requirements. Analytics node tiers allow you to enhance the performance of your analytics workloads by choosing the best tier size for your needs. This means you can choose an analytics node tier larger or smaller than the operational nodes in your cluster. This added level of customization ensures you achieve the performance required for both transactional and analytical queries — without the need to over- or under-provision your entire cluster for the sake of the analytical workload. Analytics node tiers are available in both Atlas and Atlas for Government . A standard replica set contains a primary node for reads and writes and two secondary nodes that are read only. Analytics nodes provide an additional read-only node that is dedicated to analytical reads. Choose a higher or lower analytics node tier based on your analytics needs Teams with large user bases using their BI dashboards may want to increase their analytics node tiers above that of their operational nodes. Choosing a higher tier can be useful when you have many users or require more memory to serve analytics needs. Scaling up the entire cluster tier would be costly, but scaling up just your analytics node tiers helps optimize the cost. Teams with inconsistent needs may want to decrease their analytics node tier below that of their operational nodes. The ability to set a lower tier gives you flexibility and cost savings when you have fewer users or analytics are not your top priority. With analytics node tiers, you get more discretion and control over how you manage your analytics workloads by choosing the appropriately sized tier for your analytics needs. Get started today by setting up a new cluster or adding an analytics node tier to any existing cluster. Check out our documentation to learn more.

August 3, 2022

Video: Canva's Lessons From Scaling MongoDB Atlas to 10 Billion Documents Across 100 Nodes

Running complex, global, and mission-critical infrastructure at scale is difficult, and anyone who has done it for any length of time usually has a few gnarly lessons to share. At MongoDB World in June 2022, we were lucky enough to feature someone who had done just that. Michael Pearson , software engineering team lead at Canva , gave a talk titled “10 Billion Documents: How Canva Scaled MongoDB to 100 Nodes.” I’ve had the pleasure of working alongside Pearson and his team for almost a year now, and his presentation focused on some of the massive challenges (and lessons) they’ve faced over the last two years as they have scaled into tens of terabytes of data and tens of billions of documents. I’m writing this blog to give a few highlights, but I’d recommend everyone check the original talk in full: A tricky problem For the uninitiated, Canva is a visual communication platform that empowers its users to design anything and publish anywhere. Or, as Pearson explained in his talk, “Canva is a really simple way to create beautiful designs and presentations.” Canva’s mission is to empower the world to design, and more than 85 million people in over 190 countries use the platform every month. As you can imagine, this presents a huge data challenge — and opportunity. Canva holds more than 10 billion designs and receives up to 30,000 document requests per second. The success of the platform comes down to providing a fantastic user experience every time, and to do that they need to present their customers with the right data at the right time. “This could be a really tricky problem for a database platform, particularly for a business based in Sydney with many users on the other side of the world,” said Pearson. MongoDB Atlas supports the Document Service, which enables opening, creating, updating, or deleting any design on Canva. The Document Service is critical for every single user — if the Document Service is down, then Canva’s users can’t design. But before we get too far into things, we should probably start with why Canva started using MongoDB in the first place. Flexibility to support rapidly changing requirements Michael Pearson, software engineering team lead at Canva. “Canva was launched to the world back in 2013, when MongoDB was very new to the scene,” explains Pearson. “I'm not sure if there were any other databases that would have been up for the challenge.” From those earliest days, MongoDB's flexible document model was the perfect fit for Canva's increasingly complex designs and document types. “The flexibility that MongoDB gave us in those early days was instrumental to our success. As the Canva platform evolved, we were throwing new schema and new features at it. MongoDB would just handle it.” Its continued innovation and problem-solving means MongoDB remains as valuable to us today as it was in 2012. Michael Pearson, software engineering team lead at Canva At the same time, it was essential that Canva’s engineering team was focused on building Canva, rather than time spent managing the data platform. With that in mind, Canva chose to run MongoDB as a service. After trying out multiple options, they went with MongoLabs and, in 2019, following MongoDB's acquisition of MongoLabs, Canva migrated onto MongoDB Atlas , running on AWS, where they remain to this day. Ten years of relative bliss “Before 2021, we had a very hands-off approach to how we used MongoDB,” said Pearson. “MongoDB just handled it. We didn't have to think about it at all." That's incredible, right? Think about it — for nearly a decade the team barely had to think about their data layer and could spend their time working on new features and making the actual product better for its millions of users around the world. It's what every developer wants. Eventually, though, Canva’s own success created certain challenges around scaling. With the stratospheric increase in growth, the load on the Document Service also continued to increase. MongoDB’s ability to scale horizontally through the use of sharding was critical to overcoming initial scale challenges, something that traditional database management systems would have struggled to achieve, said Pearson. With MongoDB, sharding is distributed or partitioned across multiple machines — useful when no single machine can handle large workloads. In due course, though, some attributes of Canva’s workload presented a new challenge. Said Pearson: “We were unique in that we have one cluster with one collection with a million chunks. Our documents are fairly large, given our product has evolved over the years and we put more and more stuff into our documents.” Or, Canva does many updates to relatively large documents, and by mid-2021 the surge in traffic was causing issues. “Our peak traffic caused three main problems: inability to run the balancer, latency issues, and a disk usage pretty much at capacity,” Pearson explained. “A really ineffective cache caused a really high write load to our cluster. This was causing downstream failures." Pearson discussed some of the tactical solutions the company took. “Disabling the balancer immediately brought us back to service, but now we knew that there was something wrong with that cluster and we couldn’t operate without the balancer,” said Pearson. “We also noticed that the number of chunks in our cluster had skyrocketed, from around 400,000 to just over a million.” Getting to the root of the problem The professional services team at MongoDB discovered that “metadata updates were causing anywhere from a one-second to five-minute stalls in the cluster.” Going from 400,000 chunks to a million chunks, at the rate of a minute of each change, was not optimal. There were three things to address with that cluster: reduce the number of chunks, reduce that disk contention, and reduce the size of documents. “With regard to reducing the number of chunks, we just took any contiguous chunks on a shard and merged them unconditionally,” said Pearson. “This was tooling built in collaboration with MongoDB.” After three months of merging chunks, Canva saw massive improvements in its cluster’s performance. A failure rate during reboot of around 4% dwindled to less than 1% during maintenance operations. Further, to address latency spikes and full-disk capacity, the team formulated a six-step plan to move from network-based storage volumes to locally attached disks. This has proved a huge success. “We were able to run the balancer. Our large spikes in latency were pretty much all gone, and our disk usage was almost at zero,” Pearson said. He continued: "The key takeaway for me is that sharding is great, but it's never a silver bullet. I don't think we would have caught these issues so quickly without such a thorough incident review process and such a close working relationship with MongoDB." What was learned? After presenting all of that information, Pearson closed out the presentation with a few key lessons. For anyone interested in running infrastructure at a massive scale, they are simple and worth taking note of: Take advantage of the flexible document model to accelerate your pace of development. Ensure chunks are distributed uniformly across the cluster in a consistent size. Maintain a thorough incident review process and include your trusted partners (such as MongoDB). Reliability is an essential part of Canva’s engineering practice, and prolonged service disruptions were particularly upsetting not only for engineers but for Canva’s global users. Pearson is glad to report that Canva has seen a turnaround in the number of incidents impacting its Document Service. This has freed the document team to shift focus back to shipping features and ensuring every user has a flawless experience using Canva. Interested in joining Canva as it pursues its mission to empower the world to design? Canva is looking for a software engineer to join its Core Data team. Want to take advantage of the flexible document model to accelerate your pace of development? Learn more about MongoDB Atlas .

August 2, 2022

Tools for Implementing Zero Trust Security With MongoDB

The practice of protecting IT environments from unauthorized access used to be centered on perimeter security — the strategy of securing the perimeter but allowing unrestricted access inside it. As users became increasingly mobile and IT assets became increasingly dispersed, however, the notion of a network perimeter became obsolete. That strategy has now been replaced by the concept of zero trust. In a zero trust environment, the perimeter is assumed to have been breached. There are no trusted users, and no user or device gains trust simply because of its physical or network location. Every user, device, and connection must be continually verified and audited. MongoDB offers several tools and features for integrating our products into a zero trust environment, including: Security by default Multiple forms of authentication TLS and SSL encryption X.509 security certificates Role-based access control (RBAC) Database authentication logs Encryption for data at rest, in flight, and in use For government customers, MongoDB Atlas for Government is FedRAMP-ready. Security by default MongoDB Atlas clusters do not allow for any connectivity to the internet when they’re first spun up. Each dedicated MongoDB Atlas cluster is deployed in a unique virtual private cloud (VPC) configured to prohibit inbound access. (Free and shared clusters do not support VPCs.) The only way to access these clusters is through the MongoDB Atlas interface. Users can configure IP access lists to allow certain addresses to attempt to authenticate to the database. Without being included on such a list, application servers are unable to access the database. Even the person who sets up the clusters needs to add their IP address to the access list. To find out more about the security measures that protect our cloud-based database, MongoDB Atlas, and the rules governing employee access, read our whitepaper, MongoDB: Capabilities for Use in a Zero Trust Environment . Authentication Customers have several options to allow users to authenticate themselves to a database, including a username and password, LDAP proxy authentication, and Kerberos authentication. All forms of MongoDB support transport layer security (TLS) and SCRAM authentication. They are turned on by default and cannot be disabled. Traffic from clients to MongoDB Atlas is authenticated and encrypted in transit, and traffic between a customer’s internally managed MongoDB nodes is also authenticated and encrypted in transit using TLS. For passwordless authentication, MongoDB offers two different options to support the use of X.509 certificates. The first option, called “easy,” auto-generates the certificates needed to authenticate database users. The “advanced” option is for organizations already using X.509 certificates and that already have a certificate management infrastructure. The advanced option can be combined with LDAPS for authorization. Access infrastructure can only be reached via bastion hosts and by users for whom senior management has approved backend access. These hosts require multifactor authentication and are configured to require SSH keys — not passwords. Logging and auditing MongoDB supports a wide variety of auditing strategies, making it easier to monitor your zero trust environment to ensure that it remains in force and encompasses your database. Administrators can configure MongoDB to log all actions or apply filters to capture only specific events, users, or roles. Role-based auditing lets you log and report activities by specific role, such as userAdmin or dbAdmin, coupled with any roles inherited by each user, rather than having to extract activity for each individual administrator. This approach makes it easier for organizations to enforce end-to-end operational control and maintain the insight necessary for compliance and reporting. The audit log can be written to multiple destinations in a variety of formats, such as to the console and syslog (in JSON) and to a file (JSON or BSON). It can then be loaded to MongoDB and analyzed to identify relevant events. Encryption MongoDB also lets you encrypt data in flight, at rest, or even, with field-level encryption and queryable encryption , in use. For data in motion, all versions of MongoDB support TLS and SSL encryption. For data at rest, MongoDB supports AES-256 encryption, and it can also be configured for FIPS compliance. To encrypt data when it is in use, MongoDB offers client-side field-level encryption , which can be implemented to safeguard data even from database administrators and vendors who otherwise would have access to it. Securing data with client-side field-level encryption allows you to move to managed services in the cloud with greater confidence. The database only works with encrypted fields, and organizations control their own encryption keys, rather than having the database provider manage them. This additional layer of security enforces an even more fine-grained separation of duties between those who use the database and those who administer and manage it. MongoDB Atlas exclusively offers queryable encryption, which allows customers to run rich expressive queries on fully randomized encrypted data with efficiency, improving both the development process and user experience. Organizations are able to protect their business by confidently storing sensitive data and meeting compliance requirements. Zero trust and MongoDB MongoDB is optimally suited for use within a zero trust environment. MongoDB is secure by default and has developed industry-leading capabilities in key areas such as access, authorization, and encryption. Used together, these features help protect the database from outside attackers and internal users who otherwise could gain an unauthorized level of access. For more detailed information about security features in MongoDB, read our whitepaper, MongoDB: Capabilities for Use in a Zero Trust Environment .

August 2, 2022

Integrating Support for Non-Volatile Memory Into WiredTiger

Intel Optane DC Persistent Memory is a non-volatile memory (NVRAM) product that resembles both storage and memory and can be used as either. Like storage, Optane NVRAM retains data after a crash or power outage. Like memory, it sits on the memory bus and can be accessed by CPU using load/store instructions. In certain scenarios, its access latency even approaches that of dynamic random access memory (DRAM). At MongoDB, we have been thinking about how to use NVRAM in the storage engine. It can be seen as an extension of volatile DRAM, but a denser and a cheaper one. In pursuit of this goal, we extended our storage engine, WiredTiger , with a volatile NVRAM cache that retains frequently used file blocks. In this article, we share our experience, describe the lessons learned, and evaluate the costs and benefits of this approach. How to use NVRAM in the storage stack Optane NVRAM can act as both storage and memory. The persistent memory fabric itself can be packaged as a solid-state drive (SSD), as in Optane SSDs, or as a dual-inline memory module (DIMM) that looks almost like its DRAM counterpart and lives in the same type of slot on the motherboard. Even when NVRAM is packaged as a non-volatile DIMM (NVDIMM), we can ask the operating system to present it as a block device, put a file system on top, and use it just like regular storage. Broadly speaking, there are three ways to use NVRAM: As regular storage As persistent memory As an extension to volatile memory NVRAM as storage Using NVRAM as regular storage can deliver superior throughput (compared to SSD) for read-dominant workloads, but this approach hinders write-dominant workloads because of Optane NVRAM’s limited write throughput (see the section “Performance properties of Optane NVRAM”). In any case, both the price and density of NVRAM are closer to those of DRAM than to those of SSD, so using it as storage is not recommended. NVRAM as persistent memory Imagine that all your data structures live in memory and that you never have to worry about saving them to files. They are just there, even after you quit your application or if it suffers a crash. Although this setup sounds simple, in practice, it is still challenging to program for this model. If your system crashes and you would like to be able to find your data after restart, you need to name it. A variable name is not sufficient, because it is not unique; thus, you have to restructure your code to make sure your data has persistent identifiers. Persistent Memory Development Kit (PMDK) provides APIs for that. A more difficult problem is surviving a crash. Your program may crash in the middle of a logical operation on a data structure. For example, suppose you are inserting an item into a linked list, and you have set the source pointer, but the crash occurs before setting the destination pointer. Upon restart, you’ll end up with corrupted data. To make matters worse, even if the logical operation had completed before the crash, the data might have been written only to CPU caches but not persisted to the memory itself . One solution is to wrap memory operations in transactions; however, programming transactional memory is notoriously difficult. Another solution is to use prepackaged data structures and APIs, but if you are looking to create your own highly optimized data structures, you must implement your own logging and recovery or other mechanisms that protect your data similarly to transactions. NVRAM as an extension of volatile memory Somewhat counterintuitively, this option involves disregarding the persistence of NVRAM and using it as a volatile extension of DRAM. Why would you want to do that? Suppose you have a fixed budget to buy extra memory for your system. You can either afford N GB of DRAM or about M*N GB of NVRAM — that’s because NVRAM is denser and cheaper per byte than DRAM (about three times cheaper, at the time of writing). Depending on your application, you might be better off in terms of performance/$$ if you buy additional NVRAM, as opposed to DRAM. In support of this use case, Intel provides a hardware mechanism, called Memory Mode, which treats NVRAM as “regular” system memory and uses DRAM as its cache. In other words, the hardware will do its best to place frequently used data structures in DRAM, and the rest will reside in NVRAM. The beauty of this mechanism is that it requires absolutely no changes to applications. The downside is that it may perform worse than a custom solution for certain workloads (see section “How NVCache affects performance”). Our solution is a custom-built volatile cache that resides in NVRAM. Our architecture Our NVRAM cache (or NVCache) is a component of the MongoDB storage engine WiredTiger. For persistent storage, WiredTiger organizes data into blocks, where keys and values are efficiently encoded and (optionally) compressed and encrypted. For fast query of its B+-tree data structure, WiredTiger transforms blocks into pages, where keys/values are decoded and indexed. It keeps pages in its DRAM page cache. Figure 1. &nbsp;The architecture of NVCache. Figure 1 shows the architecture of NVCache. NVCache is the new component, and the rest are part of WiredTiger. NVCache sits next to the block manager, which is the code responsible for reading/writing the data from/to persistent storage. Let’s look at each path in turn. Read path: If the page cache cannot locate the searched-for data, it issues a read to the block manager (1). The block manager checks whether the block is present in NVCache (2), accesses it from NVCache if it is (3), and reads it from disk if it is not (4). The block manager then transforms the block into a page, decrypting and decompressing it if needed, and then hands it over to the page cache (5). It also notifies NVCache that it has read a new block, and NVCache then has the discretion to accept it (6). NVCache stores the blocks in the same format as they are stored on disk (e.g., compressed or encrypted if those configuration options were chosen). Write path: The write path differs from the read path in that WiredTiger does not modify disk blocks in place. It writes updates into in-memory data structures and then converts them into new pages, which would be sent to disk either during eviction from the page cache or during a checkpoint (7). When the block manager receives a new page, it converts it into a new block, writes the block to storage (8), and informs NVCache (9). NVCache then has the discretion to accept it. Obsolete blocks are eventually freed, at which time the block manager instructs NVCache to invalidate cached copies (10). To avoid running out of space, NVCache periodically evicts less-used blocks. The eviction thread runs once a second. Overall, this design is straightforward, but making it performant was a challenge. As expected with brand new storage or memory devices, the software must cater to their unique performance properties. In the next section, we focus on these performance features and explain how we adapted our cache to play along. Performance properties of Optane NVRAM In low-bandwidth scenarios, the access latency of Optane NVRAM approaches that of DRAM. A small read takes about 160 to 300 nanoseconds, depending on whether it is part of a sequential or a random access pattern1; a read from DRAM takes about 90 nanoseconds.3 Small writes are as fast as in DRAM3 because the data only has to reach the memory controller, where it will be automatically persisted in case of a power loss. In high-bandwidth scenarios, we usually look at throughput. Sequential read throughput is about 6 GB/s for a single NVDIMM 1,2 and scales linearly as you add more memory modules. (A single 2nd Generation Intel Xeon Scalable processor can support up to six NVDIMMs.) The write throughput is more limited: We observed up to 0.6 GB/s on a single NVDIMM2, and others observed up to 2.3 GB/s. 3 Again, if your workload writes to different NVDIMMs, the throughput will scale with the number of modules in your system. A somewhat troublesome observation about write throughput is that it scales negatively as you add more threads. Write throughput peaks at one or two concurrent threads and then drops as more threads are added. 2,3 More importantly, we were surprised to find that, on Optane NVRAM, the presence of writers disproportionately affects the throughput of readers. Figure 2. &nbsp;Read throughput in presence of concurrent writer threads. Figure 2 shows how the throughput of eight reader threads drops as more concurrent writers are added. Although this effect is present on both DRAM and NVRAM (and certainly on other storage devices), on Optane NVRAM, the effect is much more pronounced. Performance of reads will suffer in the presence of writes. This important observation drove the design of our NVCache. Throttling writes in caches for Optane NVRam For a cache to be useful, it must contain popular data. The duties of admitting fresh data and expunging the old fall on cache admission and eviction policies, respectively. Both admission and eviction generate writes, and, because writes hurt the performance of reads on Optane, admission and eviction will interfere with performance of cache retrievals (which involve reads). Thus, we have a trade-off: On one hand, admission and eviction are crucial to making the cache useful. On the other hand, the write operations that they generate will hamper the performance of data retrievals, thereby making cache less performant. To resolve this tension, we introduced the Overhead Bypass (OBP) metric, which is a ratio of reads and writes applied to the cache. Keeping this ratio under a threshold allowed us to limit the overhead of writes: OBP = (blocks_inserted + blocks_deleted) / blocks_looked_up Intuitively, blocks_looked_up correlates with the benefit of using the cache, whereas the sum of blocks_inserted and blocks_deleted correlates with the cost. NVCache throttles admission and eviction to keep this ratio under 10%. (Our source code is available in the WiredTiger public GitHub repository .) Without OBP, the sheer overhead of data admission and eviction was quite substantial. To measure this overhead in its purest form, we experimented with workloads that do not stand to benefit from any extra caching, such as those with small datasets that fit into the OS buffer cache (in DRAM) or those that perform so many writes that they quickly invalidate any cached data. We found that using NVCache without the OBP feature caused these workloads to run up to two times slower than without the cache. Introducing the OBP completely eliminated the overhead and enabled the workloads that stand to benefit from extra caching to enjoy better performance. How NVCache affects performance In this section, we’ll look in detail at the performance of workloads with large datasets that stand to benefit from an additional cache. Experimental system: The following experiments were performed on a Lenovo ThinkSystem SR360 with two Intel Xeon Gold 5218 CPUs. Each CPU has 16 hyper-threaded cores. The system has two Intel Optane persistent memory modules of 126 GB each. For storage, we used an Intel Optane P4800X SSD. We configured our system with only 32 GB of DRAM to make sure that extra memory in the form of NVRAM would be called for. We present the data with widely used YCSB benchmarks 4,5 (Table 1), although we also performed analysis with our in-house benchmarks and reached similar conclusions. Table 1. &nbsp; Characteristics of YCSB benchmarks The following charts compare the throughput of YCSB with NVCache, with Intel Memory Mode (MM), and with OpenCAS6 — a kernel implementation of NVRAM-resident cache from Intel. OpenCAS was configured in the write-around mode, which was the best option for limiting the harmful effect of writes.7 Figures 3a-c shows the data in configurations using 63 GB, 126 GB, and 252 GB of NVRAM, respectively. Figure 3. &nbsp; Throughput of YCSB under Memory Mode (MM), OpenCAS, and NVCache relative to running with DRAM only. We make the following three observations: OpenCAS cache delivers no performance benefit from extra NVRAM. It achieves a similar or better read hit rate as the NVCache but also makes two orders of magnitude more writes to NVRAM, probably because it does not throttle the rate of admission. Writes interfere with performance of reads, which is probably why this cache delivers no performance benefits. When the dataset size exceeds NVRAM capacity, NVCache provides substantially better performance than Memory Mode. As shown in Figure 3a, NVCache outperforms the memory mode by between 30% (for YCSB-B) and 169% (for YCSB-C). Furthermore, the memory mode hurts YCSB-A’s update throughput by about 18% relative to the DRAM-only baseline, while NVCache does not. Memory mode performs comparably to NVCache when NVRAM is ample. With 252 GB of NVRAM, all datasets comfortably fit into the NVRAM. Two factors explain why NVCache loses its edge over MM with ample NVRAM: (1) For NVCache, the marginal utility of additional NVRAM is small after 126 GB; NVCache hit rate grows by about 20% when we increase NVRAM size from 63 GB to 126 GB, but only by another 5% if we increase it from 126 GB to 252 GB. (2) While MM allows the kernel buffer cache to expand into NVRAM, NVCache confines it to DRAM, which is also used by the WiredTiger’s page cache. Contention for DRAM limits performance. Overall, the benefit of a custom NVRAM cache solution is that it provides better performance than the Memory Mode for large workloads. The disadvantage is that it requires new software, whereas MM can be used without any changes to applications. Performance and cost In this section, we explore the trade-offs of using Optane NVRAM as a volatile extension of DRAM versus just using more DRAM. To that end, we take a fixed memory budget of 96 GB and vary the fraction satisfied by DRAM and NVRAM as shown in Table 2. Table 2. &nbsp; Budget of memory configurations containing both DRAM and NVRAM relative to DRAM-only. We use the NVRAM-to-DRAM price ration of 0.38. 8 Figure 4. &nbsp; Performance per dollar as the amount of NVRAM increases and the amount of DRAM decreases (in YCSB workloads). Figure 4 shows the performance of YCSB under these configurations normalized to using 96 GB DRAM and divided by the cost ratio in column 3. In other words, these are performance/$ numbers relative to the DRAM-only configuration. In these experiments, we used only NVCache to manage NVRAM, as it performed comparably to or better than other options. Positive numbers mean that the performance decreased less than the memory cost. Read-only or read-mostly workloads that benefit from the NVCache experience a positive gain, as expected. Although in most cases performance predictably drops as the amount of DRAM decreases, YCSB-C in configuration with 64 GB NVRAM and 32 GB DRAM performs better than it does with 96 GB DRAM — so we decrease the system cost and improve performance in absolute terms. This occurs because beyond 32 GB of DRAM, the utility of additional memory (and a larger page cache) is considerably smaller than the loss in performance due to a smaller NVCache. YCSB-A, whose write intensity prevents it from deriving benefits of any additional caching, suffers the overall loss in terms of performance/$. Its performance drops at a steeper rate than the memory cost as we decrease the amount of DRAM. We conclude that NVRAM is a cost-effective method of reducing memory cost while balancing the impact on performance for read-dominant workloads. At the same time, even a modest presence of writes can render NVRAM unprofitable relative to DRAM. References J. Izraelevitz, et al. Basic Performance Measurements of the Intel Optane DC Persistent Memory Module . arXiv:1903.05714. We Replaced an SSD with Storage Class Memory. Here is What We Learned by Sasha Fedorova. The MongoDB Engineering Journal. Jian Yang, et al. An Empirical Guide to the Behavior and Use of Scalable Persistent Memory . USENIX File Access and Storage Conference (FAST 2020) . Yahoo! Cloud Serving Benchmark, Git Repo . B.F. Cooper, et al. Benchmarking Cloud Serving Systems with YCSB . SoCC '10: Proceedings of the 1st ACM Symposium on Cloud Computing . Open Cache Acceleration Software . Open CAS Linux — Admin Guide . H.T. Kassa, et al. Improving Performance of Flash Based Key-value Stores Using Storage Class Memory as a Volatile Memory Extension . USENIX Annual Technical Conference (USENIX ATC 21) .

July 25, 2022