MongoDB Query API Webinar: FAQ
Last week we held a live webinar on the MongoDB Query API and our lineup of idiomatic programming language drivers. There were many great questions during the session, and in this post, what I want to do is share the most frequently asked ones with you. But first - here is a quick summary of what MongoDB Query API is all about if you are unfamiliar with it. What is MongoDB Query API? MongoDB is built upon the document data model . The document model is designed to be intuitive, flexible, universal, and powerful. You can easily work with a variety of data, and because documents map directly to the objects in your code, it fits naturally in your app development experience. MongoDB Query API lets you work with data as code and build any class of application faster by giving you extensive query capabilities natively in any modern programming language. Whether you’re working with transactional data, looking for search capabilities, or trying to run sophisticated real-time analytics, MongoDB Query API can meet your needs. MongoDB Query API has some unique features like its expressive query, primary and secondary indexes, powerful aggregations and transformations, on-demand materialized views, and more — enabling you to work with data of any structure, at any scale. Some key features to highlight: Indexes To optimize any workload and query pattern you can take advantage of a large set of index types like multi-key (for arrays), wildcard, geospatial, and more and index any field no matter how deeply nested it is within your documents. Fully featured secondary indexes are document-optimized and include partial, unique, case insensitive, and sparse. Aggregation Pipeline Aggregation pipeline lets you group, transform, and analyze your data to support any class of workload. You can choose from dozens of aggregation stages and over 200 operators to build modular and expressive pipelines. You can also use low-code tools like MongoDB Compass to drag and drop stages, examine intermediate output, and export to your programming language of choice. On-Demand Materialized Views The powerful $merge aggregation stage allows you to combine the results of your aggregation pipeline with existing collections to update and enrich data without having to recompute your entire data set. You can output results to sharded and unsharded collections while simultaneously defining indexes on each view Geospatial and Graph Utilize MongoDB’s built-in natively ability to store and run queries against geospatial data Use operators like $graphLookup to quickly traverse connected data sets These are just a few of the features we highlighted in the MongoDB Query API webinar. No matter what type of application you are thinking of building or managing, MongoDB Query API can meet your needs as the needs of your users and application change. FAQs for MongoDB Query API Here are the most common questions asked during the webinar: Do we have access to the data sets presented in this webinar? Yes, you can easily create a cluster and load the sample data sets into Atlas. Instructions on how to get started are here . How can I access full-text search capabilities? Text search is a standard feature of MongoDB Atlas. You can go to cloud.mongodb.com to try it out using sample data sets. Does VS code plugin support Aggregation? Yes, it does. You can learn more about the VS code plugin on our docs page. If you need to pass variable values in the aggregation, say the price range from the app as an input, how would you do that? This is no different than sending a query - since you construct your aggregation in your application you just fill in the field you want with value/variable in your code. Is there any best practice document on MongoDB query API to have stable performance and utilize minimum resources? Yes, we have tips and tricks on optimizing performance by utilizing indexes, filters, and tools here . Does MongoDB support the use of multiple different indexes to meet the needs of a single query? Yes, this can be accomplished by the use of compound indexes. You can learn more about it in our docs here . If you work with big data and create a collection, is it smarter to create indexes first or after the collection is filled (regarding the time to create a collection)? It is better to create the indexes first as they will take less time to create if the collection is empty, but you still have an option to create the index once the data is there in the collection. There are multiple great benefits of MongoDB’s indexing capabilities: When building indexes, there is no impact on your app’s availability since the index operation is online. Flexibility to add and remove indexes at any time. Ability to hide indexes to evaluate the impact of removing them before officially dropping them. Where do I go to learn more? Here are some resources to help you get started: MongoDB Query API page MongoDB University MongoDB Docs You can also check out the webinar replay here .
How to Get Started with MongoDB Atlas and Confluent Cloud
Every year more and more applications are leveraging the public cloud and reaping the benefits of elastic scale and rapid provisioning. Forward-thinking companies such as MongoDB and Confluent have embraced this trend, building cloud-based solutions such as MongoDB Atlas and Confluent Cloud that work across all three major cloud providers. Companies across many industries have been leveraging Confluent and MongoDB to drive their businesses forward for years. From insurance providers gaining a customer-360 view for a personalized experience to global retail chains optimizing logistics with a real-time supply chain application, the connected technologies have made it easier to build applications with event-driven data requirements. The latest iteration of this technology partnership simplifies getting started with a cloud-first approach, ultimately improving developer’s productivity when building modern cloud-based applications with data in motion. Today, the MongoDB Atlas source and sink connectors are generally available within Confluent Cloud. With Confluent’s cloud-native service for Apache Kafka® and these fully managed connectors, setup of your MongoDB Atlas integration is simple. There is no need to install Kafka Connect or the MongoDB Connector for Apache Kafka, or to worry about scaling your deployment. All the infrastructure provisioning and management is taken care of for you, enabling you to focus on what brings you the most value — developing and releasing your applications rapidly. Let’s walk through a simple example of taking data from a MongoDB cluster in Virginia and writing it into a MongoDB cluster in Ireland. We will use a python application to write fictitious data into our source cluster. Step 1: Set up Confluent Cloud First, if you’ve not done so already, sign up for a free trial of Confluent Cloud . You can then use the Quick Start for Apache Kafka using Confluent Cloud tutorial to create a new Kafka cluster. Once the cluster is created, you need to enable egress IPs and copy the list of IP addresses. This list of IPs will be used as an IP Allow list in MongoDB Atlas. To locate this list, select “Custer Settings” and then the “Networking” tab. Keep this tab open for future reference: you will need to copy these IP addresses into the Atlas cluster in Step 2. Step 2: Set Up the Source MongoDB Atlas Cluster For a detailed guide on creating your own MongoDB Atlas cluster, see the Getting Started with Atlas tutorial. For the purposes of this article, we have created an M10 MongoDB Atlas cluster using the AWS cloud in the us-east-1 (Virginia) data center to be used as the source, and an M10 MongoDB Atlas cluster using the AWS cloud in the eu-west-1 (Ireland) data center to be used as the sink. Once your clusters are created, you will need to configure two settings in order to make a connection: database access and network access. Network Access You have two options for allowing secure network access from Confluent Cloud to MongoDB Atlas: You can use AWS PrivateLink, or you can secure the connection by allowing only specific IP connections from Confluent Cloud to your Atlas cluster. In this article, we cover securing via IPs. For information on setting up using PrivateLink, read the article Using the Fully Managed MongoDB Atlas Connector in a Secure Environment . To accept external connections in MongoDB Atlas via specific IP addresses, launch the “IP Access List” entry dialog under the Network Access menu. Here you add all the IP addresses that were listed in Confluent Cloud from Step 1. Once all the egress IPs from Confluent Cloud are added, you can configure the user account that will be used to connect from Confluent Cloud to MongoDB Atlas. Configure user authentication in the Database Access menu. Database Access You can authenticate to MongoDB Atlas using username/password, certificates, or AWS identity and access management (IAM) authentication methods. To create a username and password that will be used for connection from Confluent Cloud, select the “+ Add new Database User” option from the Database Access menu. Provide a username and password and make a note of this credential, because you will need it in Step 3 and Step 4 when you configure the MongoDB Atlas source and sink connectors in Confluent Cloud. Note: In this article we are creating one credential and using it for both the MongoDB Atlas source and MongoDB sink connectors. This is because both of the clusters used in this article are from the same Atlas project. Now that the Atlas cluster is created, the Confluent Cloud egress IPs are added to the MongoDB Atlas Allow list, and the database access credentials are defined, you are ready to configure the MongoDB Atlas source and MongoDB Atlas sink connectors in Confluent Cloud. Step 3: Configure the Atlas Source Now that you have two clusters up and running, you can configure the MongoDB Atlas connectors in Confluent Cloud. To do this, select “Connectors” from the menu, and type “MongoDB Atlas” in the Filters textbox. Note: When configuring MongoDB Atlas source And MongoDB Atlas sink, you will need the connection host name of your Atlas clusters. You can obtain this host name from the MongoDB connection string. An easy way to do this is by clicking on the "Connect" button for your cluster. This will launch the Connect dialog. You can choose any of the Connect options. For purposes of illustration, if you click on “Connect using MongoDB Compass.” you will see the following: The highlighted part in the above figure is the connection hostname you will use when configuring the source and sink connectors in Confluent Cloud. Configuring the MongoDB Atlas Source Connector Selecting “MongoDbAtlasSource” from the list of Confluent Cloud connectors presents you with several configuration options. The “Kafka Cluster credentials” choice is an API-based authentication that the connector will use for authentication with the Kafka broker. You can generate a new API key and secret by using the hyperlink. Recall that the connection host is obtained from the MongoDB connection string. Details on how to find this are described at the beginning of this section. The “Copy existing data” choice tells the connector upon initial startup to copy all the existing data in the source collection into the desired topic. Any changes to the data that occur during the copy process are applied once the copy is completed. By default, messages from the MongoDB source are sent to the Kafka topic as strings. The connector supports outputting messages in formats such as JSON and AVRO. Recall that the MongoDB source connector reads change stream data as events. Change stream event metadata is wrapped in the message sent to the Kafka topic. If you want just the message contents, you can set the “Publish full document only” output message to true. Note: For source connectors, the number of tasks will always be “1”: otherwise you will run the risk of duplicate data being written to the topic, because multiple workers would effectively be reading from the same change stream event stream. To scale the source, you could create multiple source connectors and define a pipeline that looks at only a portion of the collection. Currently this capability for defining a pipeline is not yet available in Confluent Cloud. Step 4: Generate Test Data At this point, you could run your python data generator application and start inserting data into the Stocks.StockData collection at your source. This will cause the connector to automatically create the topic “demo.Stocks.StockData.” To use the generator, git-clone the stockgenmongo folder in the above-referenced repository and launch the data generation as follows: python stockgen.py -c "< >" Where the MongoDB connection URL is the full connection string obtained from the Atlas source cluster. An example connection string is as follows: mongodb+srv://kafkauser:firstname.lastname@example.org Note: You might need to pip-install pymongo and dnspython first. If you do not wish to use this data generator, you will need to create the Kafka topic first before configuring the MongoDB Atlas sink. You can do this by using the Add a Topic dialog in the Topics tab of the Confluent Cloud administration portal. Step 5: Configuring the MongoDB Atlas Sink Selecting “MongoDB Atlas Sink” from the list of Confluent Cloud connectors will present you with several configuration options. After you pick the topic to source data from Kafka, you will be presented with additional configuration options. Because you chose to write your data in the source by using JSON, you need to select “JSON” in the input message format. The Kafka API key is an API key and secret used for connector authentication with Confluent Cloud. Recall that you obtain the connection host from the MongoDB connection string. Details on how to find this are described previously at the beginning of Step 3. The “Connection details” section allows you to define behavior such as creating a new document for every topic message or updating an existing document based upon a value in the message. These behaviors are known as document ID and write model strategies. For more information, check out the MongoDB Connector for Apache Kafka sink documentation . If order of the data in the sink collection is not important, you could spin up multiple tasks to gain an increase in write performance. Step 6: Verify Your Data Arrived at the Sink You can verify the data has arrived at the sink via the Atlas web interface. Navigate to the collection data via the Collections button. Now that your data is in Atlas, you can leverage many of the Atlas platform capabilities such as Atlas Search, Atlas Online Archive for easy data movement to low-cost storage, and MongoDB Charts for point-and-click data visualization. Here is a chart created in about one minute using the data generated from the sink cluster. Summary Apache Kafka and MongoDB help power many strategic business use cases, such as modernizing legacy monolithic systems, single views, batch processing, and event-driven architectures, to name a few. Today, Confluent and MongoDB Cloud and MongoDB Atlas provide fully managed solutions that enable you to focus on the business problem you are trying to solve versus spinning your tires in infrastructure configuration and maintenance. Register for our joint webinar to learn more!
Use x509 certificate-based authentication with MongoDB and Apache Kafka
Kafka has emerged as a popular event streaming platform. The inherent "pub/sub" model can be viewed as a method for moving data between systems. As such, MongoDB offers a Kafka connector , enabling Kafka topics to be copied into a MongoDB cluster (the sink). Similarly, the connector enables data movement from a MongoDB cluster (the source) into Kafka topics. To access data securely, certificate-based X.509 authentication is a natural choice for server-to-server authentication scenarios with Kafka and MongoDB. Certificates avoid having to store or manage usernames and passwords when used with database connection strings. For example, such user credentials could be inadvertently exposed if "hard-coded" in configuration files or other uses. An X.509 certificate is a structured, binary record. This record consists of several key and value pairs. X.509 certificates use the widely accepted international X.509 public key infrastructure (PKI) standard. The use of certificates prevents user credential exposure. Authentication requests with certificates verifies that any public key presented by a client or another member of the cluster belongs to that client or member. The X.509 certificate method for authentication is more secure than conventional password-based certification because each server machine needs their own dedicated key to participate in the cluster. For use with secure TLS/SSL connections, MongoDB supports X.509 certificate authentication allowing clients to use public key infrastructure in lieu of SCRAM (username and password). The certificate encodes two very important pieces of information: the server's public key and a digital signature that can be used to confirm the certificate's authenticity. Additionally, the certificate will include metadata used by the Certificate Authority to track the certificate and provide guidelines on how the public key can be used. Using the server's public key, the client and server are able to negotiate a shared symmetric key securely, which can be used to secure communications. Users can either generate their own certificates and keys (self-managed) or use the Atlas PKI. In either case, first a project-specific CA private and public key is generated, and then a per-user private key and signed X.509 identity certificate is created. If using self-managed X.509 infrastructure , you'll need to upload your CA public key certificate into your Atlas project. If using Atlas-managed X.509 infrastructure, you'll need to download the project private key and provide that to your Kafka Connect service. This signed certificate is then pushed to each server member in your Atlas cluster. The below diagram shows the deployment of a standard 3 node replica set and client using x.509 authentication: In non-production environments, the basic SCRAM authentication method may be most suitable. However, for production environments or server-server scenarios such as a Kafka-MongoDB integration, X.509 authentication is the recommended mechanism. To use X.509 certification for server-server authentication, first confirm that you are able to authenticate to an Atlas cluster using X.509 certificates. Then follow the steps below. Prerequisites: Openssl must be installed Project-level CA & user certificates created in PEM format If using Atlas-managed certificates, user-specific client certificate (see X.509 tab: https://docs.atlas.mongodb.com/security-add-mongodb-users/#database-user-authentication ) If using self-managed X.509 auth, you will need to create & upload your CA public key to Atlas (see https://docs.atlas.mongodb.com/security-self-managed-x509/ ), and have a user-specific client certificate ready Ensure that you have installed the MongoDB Kafka Connector and understand how to use it with Kafka Connect. Then follow these steps: Obtain the client user certificate from your system administrator (or from Atlas). In this example, the user certificate is stored in PEM file kafkaclient-X509-cert.pem and will be associated with the Atlas database user kafka-svc . Convert the PEM file to a password-protected PKCS12 formatted certificate by running this command: openssl pkcs12 -export -in kafkaclient-X509-cert.pem -out kafkaclient-X509-cert.p12 -password pass:mypassword Copy PKCS12 certificate ( kafkaclient-x509-cert.p12 ) to the server where Kafka Connect is running. Note the full path of the PKCS12 certificate location. Update the Kafka Connect configuration in the KAFKA_OPTS environment variable: export KAFKA_OPTS="-Djavax.net.ssl.keyStore=<path to kafkaclient-x509-cert.p12> -Djavax.net.ssl.keyStorePassword=mypassword -Djavax.net.ssl.keyStoreType=PKCS12" Restart Kafka Connect Update the MongoDB Connector configuration to use a connection URI with the following parameter options: Connection.uri: "mongodb+srv://<mongodb-host>/test?authSource=%24external&authMechanism=MONGODB-X509&subjectName=kafka-svc" Re-deploy the MongoDB connector using the Kafka Connect REST API, with the above configuration for the connection URI. Download the latest MongoDB Connector for Apache Kafka 1.5 from the Confluent Hub ! Read the MongoDB Connector for Apache Kafka documentation . Questions/Need help with the connector? Ask the Community .
Built With MongoDB: Go
“Social media was supposed to augment our friendships and give us more to talk about — but it’s actually starting to replace our relationships,” laments Sean Conrad , the co-founder and CEO of Go. After 10 years of working at large tech companies and bootstrapping a multimillion-dollar gaming company, Sean started building Go , a social app focused on helping friends create plans to hang out in person. Combining data science, social networking, and event aggregation, Go provides users with a custom, curated feed of cool things to do and friends to do them with. Go is live in New Zealand and (very recently) Australia with over 40,000 downloads and 500 businesses. The startup has raised $6.7 million in seed funding and has been building with MongoDB from the start. For this edition of #BuiltWithMongoDB, we spoke with Sean about the business, being a second-time founder and CEO, and his experience with MongoDB. MongoDB: You actually started building during the COVID-19 pandemic. How did that impact the product, given that your mission is to bring people together in real life? Sean: It impacted us in so many ways. We researched the space throughout 2019, and started building the app in early 2020, planning for a fall release in Portland or Los Angeles. And then the pandemic hit the United States. We realized it was jokingly bad that we were building an app to bring people together just when social distancing was becoming a requirement. For a month, we contemplated a lot of possible ideas, and we had some cool ones, but our passion was really about making offline connections stronger. We spent the summer working on the product, and then launched in New Zealand because that country had handled the pandemic well and reopened. The product has been a huge success in New Zealand, and after iterating on it, we recently launched in Australia. Our plan is to launch in the United States, starting from Los Angeles, during the summer of 2021. MongoDB: You mentioned that you've used MongoDB before. What has your experience been like with MongoDB as a 2x founder? Sean: At my previous company, we scaled up to about 30 million downloads, and we ran it on MongoDB. We were not database experts, and it was very easy to use. It was 2013 when we started using MongoDB. We had our hiccups and had to learn what indexes were, but we became really comfortable with the platform. For Go, we picked MongoDB out of comfort. When we got started with Go, MongoDB Realm was still in beta. We would’ve used it had it been around, but we built our first product on Firebase Firestore. Firestore ended up being a bit limiting for us because we wanted to build a feed-based system (in Go, it’s showcasing a series of events or things to do that are interesting to you and your friends), so a lot of filters are necessary. That requires many different types of unstructured data that’s difficult to put into a simple schema. Managing these things demands a lot of documents and data duplication, and MongoDB was a good fit for that. We like that Atlas has full-text search built on Apache Lucene , which is a powerful text search library. We are just getting into that. In addition, most of our compute runs on AWS. We use a lot of containerized stuff on AWS, and a little bit of Lambda stuff, and we’re moving to a serverless environment. I’m not sure what the future of Go is, but I’m confident MongoDB will play a part in it. Our mobile app is written in Flutter, Google’s competitor to React Native. We like that quite a bit. MongoDB: What is the last technical podcast you enjoyed? Sean: It’s All About Widgets , a podcast about Flutter. We’ve got a really talented group of developers on our team — two of them are ranked in the top 15 Stack Overflow Flutter contributors! One of our developers Raouf Rahiche spoke on their second episode . It was really cool to hear a team member talking on this podcast. MongoDB: As a second-time founder, what is one thing that was unexpected for you in building this business? Sean: This is the first business in which I’ve raised funding, and I couldn’t have done it without my co-founder, Jesse Berns . For my last business, I started with something small with a few people, found product-market fit, and grew that. With Go, we started with a much more grand vision in mind, so it made sense to operate more like a traditional Silicon Valley startup, raising capital and growing the team quickly. With all startups, you’re operating with very few known facts, but when you raise money everything just tends to get bigger, faster, and I always say this is like ‘operating on hard mode’ — but in our case, it’s worth it. Our goal with Go is to help people manage their friendships in the same way that LinkedIn helps people manage their professional lives, and if we’re successful, that’ll entirely change how people make plans and optimize their friendships for more time together face-to-face. It’s built to inspire us to live our ideal lives, whether that’s basement art shows, unforgettable live music, lunch with friends at a special place that could only exist in your neighborhood, or a slow bike ride down by the river. It’s built for the mundane and the thrilling and everything in between. We’re at a really exciting moment in history where all the trends — adoption of mobile, the upcoming end to the pandemic — are going to enable a culture where people want to find humanity and joy in person, and human-facing tech is going to have a big impact in the next few years. With Go, we’re really excited to be part of that. Looking to build something cool? Get started with the MongoDB for Startups program.
Built With MongoDB: Buffer
I first became a fan of Buffer during graduate school. While managing social marketing for student clubs and conferences, I relied on Buffer to manage our fun marketing campaigns. Buffer is a popular social media software that enables small businesses and content creators to plan, publish, and analyze marketing campaigns across social channels. It serves 67,536 customers across over 85 countries. The company has over $21M annual recurring revenue and has been in business for 10 years now. I recently had the opportunity to speak with Dan Farrelly , Buffer’s CTO, about the fast-growing company, his experience with MongoDB for Startups , and the challenges of growing into a CTO position. MongoDB: Let’s go back to February 2014. At that time, Buffer was a much smaller company — only about 15 people, compared with the more than 80 people now. What drew you to join? Dan: Hands down, the culture. There were two things that were unique about Buffer at that time: First, it was an entirely remote team. This was rare in the pre-pandemic world. Second, there was incredible transparency both inside and outside the org. The company was so open about salary that on the Buffer Jobs page, it had an estimated salary calculator based on role and experience. Internally, all revenue numbers and company metrics were accessible to the entire team. The executives being an open book enabled trust and free communication across the organization. And like any startup, we were all-in. Early on, I remember being at a taco shop on a Friday evening when the then-CTO texted me that the servers were crashing. I opened up my laptop at the restaurant and just started troubleshooting — doing whatever I could to try to mitigate the issue. Many people depended on us to manage their social identities, and so with a taco in one hand, and a phone on the other, we figured it out. Working at a startup is such an incredible learning curve; you have to be scrappy, push the boundaries, and find creative ways to deliver results. MongoDB: Why did the team decide to build with MongoDB? Dan: Our culture has always been engineering-centric, focused on shipping code as soon as it’s ready for production. We encourage continuous delivery of our applications. MongoDB’s products resonate with that lean culture. MongoDB doesn’t require schema migrations; the flexibility and ease of use enabled us to practice the type of engineering we wanted. MongoDB became our partner in being fast and delivering often. An additional benefit was the ability to scale easily: one type of application we were building (content scheduling for social media) had massive collection of data that had to be scheduled which required very high throughput — we were posting hundreds of thousands of times a day for social media accounts. MongoDB Atlas allowed us to scale and ensure we didn’t have to worry about our database over the years. MongoDB: Had you used MongoDB before joining Buffer? Dan: I had taken a MongoDB University course in 2012 focused on MongoDB for Node.js developers, and I had built a few side projects and prototypes with MongoDB. The course itself was fantastic: it not only talked about basic things such as setting up replication, sharding, and how the database itself works, but it also talked about some of the more complex elements (how drivers work, write concern, and fully leveraging the database). But the best way to learn about MongoDB was putting out fires at Buffer. Early on, we had monitoring and scaling issues, not with the database but with the code, and our team had to get smart about diagnosing specific issues in our application. MongoDB: What advice do you have for an engineer who wants to grow into a CTO position someday? Dan: Engineers can pursue their own roles and do a really good job while still having a limited perspective of the company. In order to become a CTO, you really need to broaden that perspective, and understand how technical strategy supports business goals. The CTO doesn’t have to be the most technical person on the team, but has to have a well-rounded view of the business and also effectively communicate across the stack. Transparency at Buffer helped me develop a wider perspective of the business. If you have ambitions to grow into a CTO role, build relationships across the organization — on the technical and business sides — and think strategically about how the code you ship drives business metrics. Looking to build something cool? Get started with the MongoDB for Startups program.
Reducing Queue Times by Using Speculative Execution
When solving concurrency problems in software, the simplest solution is often to make the trickiest part of the problem serial. Here at MongoDB, this is exactly the approach we took to implement a commit queue, where engineers submit code changes to be tested and then merged into a repository. This worked well for many smaller repositories, but for large ones such as the MongoDB Server , testing submissions one at a time proved to be too slow, with engineers sometimes waiting hours for their code to finally make it into the repository. To solve this challenge, we introduced some speculative execution on top of our original approach, which reduced the wait time for a typical week by 62%. Background Many of the engineers at MongoDB submit their code changes to a commit queue, which runs a basic set of tests on these changes before merging them to the correct repository. The main difference between the commit queue running the tests and an engineer running the tests is that the commit queue tests with the latest changes to the code base, whereas an engineer has checked out the code base at some point in the past. To ensure that it has the latest changes, the commit queue tests only one set of changes at a time before either merging the changes if the tests pass, or rejecting them and notifying the author if the tests fail. This serial approach makes the system easy to understand, but it also presents an optimization opportunity to reduce the time spent waiting for tests to start. Design Approach Parallelization The only part of this system we needed to keep serial was the part that merged changes into the repository, because this ensures that changes would be merged in the same order in which they were submitted. By far, the slowest part of the commit queue is actually running the tests, and this is the work that we wanted to split among multiple machines. Let’s assume as an example that all submissions to the commit queue take 10 minutes to run. Let’s also assume that in one day there are 30 submissions to the commit queue at roughly the same time. With the previous requirement that the queue runs serially, this means it would take 300 minutes to get through all the submissions. If we parallelize testing the submissions among 30 machines, it would take only 10 minutes of actual time from when the last change was submitted to get through all the submissions. Speculative Execution With a serial queue, each successful submission checks out the latest code in the repository, applies its changes, runs tests, then commits the code back to the repository before the next submission starts. If we do these steps in parallel however, checking out the latest code in the repository will not include the changes from submissions that would have merged before the one being tested. Parallelizing our tests requires some extra steps to ensure that submissions run tests with the code changes from prior submissions. In order to know what code changes should be applied to which tests, the commit queue must still maintain the concept of an order for each submission. That way, the third entry in the queue will know that it must apply the changes from the first and second entries, in addition to its own code changes. If any test for a submission fails, it’s rejected from the queue and any submissions after it are rerun without the code changes from the one that failed. If all tests for a submission finish running, the submission will wait to be merged until the one immediately in front of it is merged. Performance Considerations Testing with merged code changes like this requires that most of the tests pass; otherwise the system will do a lot more work than it would have done if it tested submissions one at a time, and we lose all the benefits from parallelism. In the worst-case scenario where nothing passes, the nth submission in the queue would need to be restarted each time something in front of it fails, leading to total times that any submission is run. This means that if engineers add 10 submissions to the commit queue, the new parallel approach runs tests as many as 55 times, whereas with the old serial approach the tests would always run 10 times. Maybe this worst-case scenario isn’t a big deal if the majority of submissions pass (and at MongoDB, 85% of them do). However we’d like to guarantee that an unusually bad day doesn’t make the machines running the tests do an excessive amount of unneeded work. To make this guarantee possible, we inserted a checkpoint into the queue, so that only the batch of submissions in front of the checkpoint are running tests. In the example of 10 submissions to the queue, placing the checkpoint after submission No. 3 would mean that the first three submissions start running tests while submissions No. 4 and later wait until the first three finish. It’s totally possible that everything still fails, but adding this checkpoint prevents us from doing too much extra work. With the checkpoint, a queue of length n would run: total submissions, where f is the position of the checkpoint, \ is the integer division operator, and % is the modulus operator. If engineers add 10 submissions to the queue and the checkpoint is after submission No. 3, this hybrid approach would run tests as many as 19 times, compared with 55 with a fully parallel and 10 with a fully serial approach. The following infographic helps visualize this example. Colors represent the current status of the submission: green means successful, red means failed, yellow means in progress, and gray means not yet started. Results The graph below depicts the average length of time a submission would wait before it started running tests for a representative week when processing submissions serially. Contrast these times with the graph below, which shows a representative week with the hybrid approach. For the depicted weeks, the overall average time dropped from 1,238 seconds with the serial approach to 469 seconds with the hybrid approach — a reduction of 62%. Conclusion With this hybrid approach of parallelizing the longest-running parts of the system but keeping key parts serial, we were able to reap the benefits of each approach. We saw drastic reductions in wait times while still maintaining the concept of an ordering for our commit queue. What led us to this approach were the requirements that the result should be noticeably faster for typical sizes of the problem (a queue with one to nine submissions), but could not be drastically slower in the worst-case scenario. These two guiding principles will often yield to designs that work well in real-world scenarios, even though they may not handle all edge cases gracefully.
Built With MongoDB: Queenly
The Difference Between R and D
I used to believe that Research and Development (R&D) departments should work in lockstep with Product teams so that they can stay focused on commercially-viable innovations. After all, not every innovation has a market, and not every business has resources to bet on future markets. All of that changed when I met Dr. Michael Cahill, the head of MongoDB Labs in Australia. Michael came to MongoDB through our acquisition of WiredTiger back in 2014, an open source storage engine company he co-founded with Keith Bostic. He holds a PhD in Computer Science and a history of breakthrough innovation. He also has an enlightened point of view on the role of research in any technology company. “Researchers need time and space to pursue the theoretical,” he told me. “We want them to come up with crazy ideas, with much longer time horizons.” Michael is referring to the fundamentally different mindsets required of researchers versus developers. Our developers are focused on new products or features that can make an impact in the next 3-4 quarters. Our researchers are thinking about solving problems that have the potential to reshape entire markets for decades. Big difference. Funding this kind of innovation is challenging for the MBA set, and measuring the ROI of basic research is notoriously difficult. Progress can seem slow and difficult to quantify. Our researchers occupy a space that straddles art and science, industry and academia. They spend a lot of time reading, thinking, and tinkering. Ideas are freely shared, cultivated, iterated, and sometimes abandoned. This is the price of disruptive innovation. In fact, MongoDB would never exist if our founders had set out to simply improve upon relational databases. Instead, they wanted to invent an entirely new way to manage data. It was an ambitious idea that required long-term thinking. An idea that despite MongoDB’s current success, is still only in its infancy. An idea so humongous, Michael Cahill may have even called it “crazy.” Don’t get me wrong. The work of MongoDB Labs is firmly grounded in MongoDB’s core strategy: to constantly improve the way data is stored, accessed, secured and processed. Document databases are only the first act of this play. And I’m certain the next act is being written as we speak, by Michael and his team. Have a different approach to R&D? Think my ideas are “crazy”? Let me hear about it on Twitter at @MarkLovesTech
Accelerate Data Modernization with Infosys Data Model Converter
Are you in the process of migrating applications from a relational database to MongoDB? If so, you’re likely trying to best understand and decide how your enterprise data needs to be modeled. Our previous blog discussed how Infosys Data Services Suite can help enterprises move data seamlessly from legacy relational databases to MongoDB. But moving data is only one part of the puzzle. The more significant step is choosing the target data model, or schema design, a process that usually requires several hours of highly skilled talent. That’s why we created this follow-up blog to help you get started. Rethinking Schema Design Ultimately, schema design can be the difference between an inefficient, disorganized database and a strategic one that empowers the entire company. Schema design in MongoDB requires a change in perspective for data architects, developers, and database administrators. They have to: Rethink the legacy relational data model. This model flattens data into rigid two-dimensional tabular structures of rows and columns. The new data model is a rich and dynamic one with embedded sub-documents and arrays Rethink how the data platform works. In relational databases, it is extremely difficult to change the data platform as the application evolves. However, in MongoDB, the apps and APIs come first and the data platform dynamically accommodates the data Getting Schema Design Right Begin the schema design process by considering the application’s requirements. You’ll want to model the data in a way that leverages the flexibility of the document model. In schema migrations, it may seem easy at first to simply mirror the flat schema of the relational database in the document model. However, this negates the advantages enabled by the rich and embedded data structures of the document model. For example, data that belongs to a parent-child relationship in two RDBMS tables can be collapsed (embedded) into a single document in MongoDB. The application data access patterns should also drive schema design with a specific focus on: The read/write ratio of database operations and whether it is more important to optimize the performance of one operation over another The types of queries and updates performed by the databases The lifecycle of the data and growth rate of documents Simplifying Schema Design with Infosys Data Model Converter Infosys has developed a solution called Infosys Data Model Convertor that processes source relational schema and the above-mentioned signals as inputs and automatically provides target MongoDB schema suggestions. Infosys Data Model Converter is available as part of Infosys Modernization Suite which accelerates enterprises’ modernization journey. Each schema suggestion is accompanied by a detailed analysis report. The data modeler can use this as a starting point and iterate over the schema to arrive at the final MongoDB schema. The Infosys Data Model Converter reduces 50-60% of the effort typically spent in schema design. Key Features Boosts productivity by augmenting the migration of RDBMS to NoSQL database Saves time by automatically extracting schema, query and data patterns from an existing RDBMS Comprehensively analyzes the RDBMS entity relations, data and read-and-write patterns Applies a rich set of rules and generates a fully compliant NoSQL target state data model Offers flexibility by externalizing the rules for organization-specific customizations Connects and deploys the model to the target NoSQL platform with sample data Discover more ways in which Infosys can help you unlock value from modernization. Contact us for any modernization questions.