MongoDB Developer

Coding with MongoDB - news for developers, tips and deep dives

Why I Wrote the New MongoDB Aggregations Book

In early May 2021, I published my book, Practical MongoDB Aggregations, which I released electronically and free for anyone to read . I love the MongoDB database and the uniqueness and power of its aggregation framework to analyse and manipulate massive amounts of data intuitively and efficiently. The opportunity to share this passion with others spurred me to write the book, with which I aim to support developers, architects, data analysts, data engineers, and data scientists to better understand how to maximise their productivity and effectiveness when building aggregation pipelines, as well as how to optimise these pipelines. Like many people over the past year during the pandemic, I’ve struggled to keep myself occupied when not busy doing my day job. Hence, my book was born not just from a desire to improve people’s knowledge but as my pandemic project, written over many weekends, to stave off the boredom. I believe aggregation pipelines provide a powerful domain-specific language for data processing in a way I’ve not seen before in other data-oriented tools, languages, or standards. SQL is a good data query language that caters to some analytical use cases via “group-by/having” statements. However, it typically has to be paired with a procedural language (e.g., Oracle’s PL/SQL ) to encompass an ordered set of complex data transformation rules. In the big data world of Hadoop , I find the MapReduce approach is too complex to develop with efficiently. Higher-level tools like Spark help alleviate some of this. However, by the necessity of still having to be general-purpose and versatile, the amount of Spark code required to process data sitting in any type of database is still too high for my liking. Many ETL tools provide proprietary data transformation capabilities, but these have to cater to the lowest common denominator capabilities across all the different types of databases they interact with. For these reasons and from experience, I consider MongoDB Aggregations to be the best tool for processing large data sets because it combines performance with productivity. Nevertheless, I sense the aggregation framework is shrouded in mystery for many people, hence my desire to demystify it with this book. I believe I identified a knowledge gap that many users wanted to be filled. MongoDB Inc. provides excellent reference documentation about aggregations in the MongoDB Manual , and MongoDB University provides a tremendous free online training course on aggregations . What I felt was still to be addressed was an opinionated yet informed perspective on how best to assemble aggregation pipelines from the well-documented parts—something that points the way to achieve optimal productivity and performance, accompanied by fully formed example pipelines to help put these approaches into practice. I hope readers of my book will learn some new things of value and enjoy reading it. A good test of the relevance of my book, in time, will be if people come back to it repeatedly as they continue with their journey of developing aggregations. Read the book for free now!

May 25, 2021
Developer

Infosys Media Platform & MongoDB: Metadata Management and Workflow Orchestration Across Media Supply Chain

Capitalize on current and innovative technologies in the media supply chain with the Infosys Media Platform (IMP). As a part of the cloud and cloud-based Infosys Cobalt ™ portfolio, Infosys’ unifying framework, built on MongoDB Atlas and MongoDB Enterprise Advanced , helps you facilitate creative collaboration, enable productions on an industrial scale, and monetize customer relationships. How? By integrating the various ecosystems involved and providing a common platform to connect you to services and technology solutions for the media content value chain. Infosys’ intelligently-woven media and metadata management framework, leveraging MongoDB’s document model, enables smart workflows and incorporates ML/AI to create, manage and moderate content metadata. This allows the orchestration workflows across different business functions. Additionally, the platform delivers the benefits of productivity, scalability, and agility via the cloud and streamlines collaboration among the ecosystem of partners and technology solutions. Why Infosys Media Platform (IMP)? The Infosys Media Platform consists of various modules that serve critical business functions, such as: The Curation & Digitization Module -- provides the master workflow and ingests content from internal archives and multiple sources using AI/ML to create a composite index of frame level and time-coded metadata of recognized elements (such as celebrities/known personalities, objects, brands, text, and images). It also enables intelligent ad spot identification and includes functions like automated QC, editing, review, and approval, and censor editing The Custom AI Module -- ensures that newly introduced elements such as celebrities, brands, etc., can be continuously trained and recognized. This can also be used to custom-train the AI models to recognize specific content as per customer’s needs The Localization Module -- enables collaboration across multiple locations and global vendors through automatic generation of closed captions and subtitles in multiple languages Metadata Management and Distribution Module -- enables global distribution to digital platforms at scale through standard workflow models and a state-of-the-art dashboard by orchestrating the accumulation of asset level and descriptive time-based metadata from production to delivery With the above modules, Infosys Media Platform can attain the following capabilities: Content Enrichment -- leverages AI models to process video files and generate time-coded metadata for post-production and distribution process Closed Captioning, Subtitling and Localization -- processes audio (dialogs from a video, lyrics from a song, speech from a podcast) and converts it into closed captions and subtitles Content Moderation -- recognizes the presence of mature content (profanity, violence, gore etc.) using the video content and speech detection capabilities Image Processing -- identifies various attributes of an image file, similar to the capabilities on video/music content Metadata Packaging & Distribution -- manages end-to-end supply chain of digital metadata creation, updates, packaging and distribution NLP Based Analytics -- using natural language processing capabilities of the platform, users can review any string of text (dialogs, lyrics, conversations) to determine the context of the conversation as well as the sentiment Why MongoDB for Infosys Media Platform? What company wouldn’t want a database platform that increases developer productivity and data-driven operational efficiency? MongoDB offers both. With MongoDB Atlas , you can reduce the time developers spend managing data and databases, so they can focus on value-added tasks like developing new apps. MongoDB’s document model and query language provide easier access to data, allowing developers to work quickly and efficiently to support new data structures and data types, as well as leverage database-supported roll-ups for analysis. Additionally, MongoDB Atlas introduces 100+ metrics and monitoring capabilities with its complete data platform built to improve operational productivity, so you can work smarter, not harder. A main feature of Infosys Media Platform is its cloud-agnostic nature; and as a cloud-agnostic and multi-cloud data platform, MongoDB is the only platform that satisfies IMP with its ability to run seamlessly across the globe. Through a mixed workload of real-time and transactional analytics, IMP also offers a roadmap on analytics, text search and data visualization capabilities--and MongoDB provides all these features. How MongoDB powers the data platform for Infosys Media Platform (IMP) Data for all the modules previously described is powered by the MongoDB data platform Details like profile data, accounts, ratings, translations, country/region details are stored in MongoDB Audit transactional data are currently running on SQL and there is a roadmap for moving to MongoDB data platform . In the future, MongoDB’s core capabilities will further enhance the Infosys Media Platform and customer experience. Our roadmap includes utilizing MongoDB ACID transactional capabilities to store audit details, as well as using MongoDB functions and triggers for creating cloud agnostic serverless functions. Additionally, MongoDB Atlas may be leveraged for full-text search capabilities applied to media data, and to create charts for dashboarding and real-time analytics of user subscriber data. Download the Modernization Guide

May 19, 2021
Developer

Build Applications and APIs Faster with MongoDB Atlas and AWS App Runner

MongoDB is excited to be a launch partner for AWS App Runner . Most technology organizations today are in the process of adopting DevOps practices and speeding up the development of new containerized apps. However, deploying these apps to production environments has often required additional infrastructure management, thus reducing the agility that containers promise. MongoDB Atlas provides database infrastructure on demand, simplifying the data layer of DevOps deployments. Atlas makes it simple to create, scale, and tear down databases as needed, smoothing your deployment pipeline. With AWS App Runner, it’s now even easier to deploy the application component. Given your source code or container image, AWS App Runner builds and deploys the application or API and handles load balancing, scaling, monitoring, and more. There’s no need to manage servers or containers yourself: AWS App Runner handles all of the details. AWS App Runner-based applications connect seamlessly with MongoDB Atlas, addressing both the application layer and the data layer deployment. Developers can now leverage the scalability and performance of our global, cloud-native database service for their App Runner applications. MongoDB Atlas & FastAPI on App Runner Let’s dive into deploying a real API application with AWS App Runner and Atlas to see just how straightforward it is! We’ll demonstrate an example of a microservice API managing MongoDB documents. For this blog, we’re building off the Getting Started with MongoDB and FastAPI post over on the MongoDB Developer Hub . This app sets up a FastAPI endpoint connected to MongoDB Atlas which provides a basic API to create, read, update, and delete a collection of student documents. The app is self-contained in a GitHub repository: https://github.com/mongodb-devloper/mongodb-atlas-fastapi . To get started, you simply need your MongoDB Atlas and AWS accounts. Step 0) Get a MongoDB Cloud Account If you don’t have an Atlas account yet, you can follow the "Get Started with Atlas" guide to create your account. Step 1) Create a MongoDB Atlas Cluster There are many ways to create clusters with Atlas! Choose which option from the following list best suits your needs and experience. MongoDB Atlas UI: the steps in the "Get Started with Atlas" guide will show you how to create your first cluster directly in the Atlas UI. This is the best option for new users. Keep a note of your username, password, and connection string as you will need those later. MongoDB Atlas AWS Quick Start: AWS CloudFormation users can opt to use the new MongoDB Atlas AWS Quick Start to launch a complete Atlas deployment. Once provisioned, you can get the connection string from the Stack Output’s tab in the Cloud Formation web console or AWS CLI. mongocli: Developers and command line aficionados may prefer a more direct method. The latest release of the MongoDB Cloud CLI mongocli now boasts a quickstart command which you can use to create cluster’s and so much more. Check it out: https://docs.mongodb.com/mongocli/stable/quick-start/atlas/#configure-an-service-cluster Whichever method you choose to set up your Atlas Cluster, you can always find the connection string in the Atlas console. Note this for use in the next step, in which you’ll connect your app to MongoDB Atlas. If you have any issues, check out the detailed steps to connect your application to MongoDB Atlas. Step 2) Launch your app Open up your AWS Web Console and navigate to AWS App Runner . Click the “Create an App Runner service” button. AppRunner lets you deploy apps from either pre-built Docker images or Github repositories. Our sample app works both ways. If you want something simple, choose the Docker option. For a more advanced deployment with CI/CD autowired, then opt for the Source code repository integration. Step 2.1) Launch your app from a Docker image If you don’t want to mess around with source code, then select the “Amazon ECR Public” and use the following image tag: https://gallery.ecr.aws/u1r4t8v5/mongodb-atlas-fastapi:latest Step 2.2) Launch your app from a Source code repository (Optional - skip to Step 3 if you are deploying your app with the Docker image option) To use the “Source core repository” option, first create a fork of https://github.com/mongodb-devloper/mongodb-atlas-fastapi into one of your GitHub organizations. Then, return to AppRunner and follow the instructions to create a new GitHub connection. Finally, select the “mongodb-atlas-fastapi” code repo you just created. You will need to configure a couple of settings with the GitHub option, but they are all standard. Runtime : Python 3 Build command : pip install -r requirements.txt Start command : python app.py Port : 8080 Step 3) Configure and launch your application For both the Docker and Source Code options select “Next” and then: Name your app service: mongodb-atlas-fastapi Click “Add environment variable”, add one variable with your MongoDB Atlas Connection string with Key set to MONGODB_URI_SRV and paste your connection string for the value. (For example, mongodb+srv://< user>:< password>. test1.f31en.mongodb.net/myFirstDatabase?retryWrites=true&w=majority ) Set the “Port” to 8080 NOTE - This setting will not show up for Source Code integrations. You can click Next, review the next page for correctness and then click “Create & deploy”. After a few minutes you will have a public URL endpoint to your new Student API with a full UI to manage your documents. Find the ‘Default domain’ value on the AppRunner UI, and click this - you will see your new MongoDB Atlas powered API app! If you opted to set up the Github integration, then any code changes you push to the project will automatically trigger a new deployment and rollout of your application. Since you’re using MongoDB Atlas, high availability is built in and configuration changes (scaling, upgrading, multi-region expansion, etc.) require the click of a button or simple API call. Conclusion With AWS App Runner and MongoDB Atlas, developing, deploying, and evolving an app or API is incredibly easy. In this example, we deployed a containerized API endpoint on top of an Atlas database in a matter of minutes. What will you build? Get Started with MongoDB Atlas

May 18, 2021
Developer

How DataSwitch And MongoDB Atlas Can Help Modernize Your Legacy Workloads

Data modernization is here to stay, and DataSwitch and MongoDB are leading the way forward. Research strongly indicates that the future of the Database Management System (DBMS) market is in the cloud, and the ideal way to shift from an outdated, legacy DBMS to a modern, cloud-friendly data warehouse is through data modernization. There are a few key factors driving this shift. Increasingly, companies need to store and manage unstructured data in a cloud-enabled system, as opposed to a legacy DBMS which is only designed for structured data. Moreover, the amount of data generated by a business is increasing at a rate of 55% to 65% every year and the majority of it is unstructured. A modernized database that can improve data quality and availability provides tremendous benefits in performance, scalability, and cost optimization. It also provides a foundation for improving business value through informed decision-making. Additionally, cloud-enabled databases support greater agility so you can upgrade current applications and build new ones faster to meet customer demand. Gartner predicts that by 2022, 75% of all databases will be on the cloud – either by direct deployment or through data migration and modernization. But research shows that over 40% of migration projects fail. This is due to challenges such as: Inadequate knowledge of legacy applications and their data design Complexity of code and design from different legacy applications Lack of automation tools for transforming from legacy data processing to cloud-friendly data and processes It is essential to harness a strategic approach and choose the right partner for your data modernization journey. We’re here to help you do just that. Why MongoDB? MongoDB is built for modern application developers and for the cloud era. As a general purpose, document-based, distributed database, it facilitates high productivity and can handle huge volumes of data. The document database stores data in JSON-like documents and is built on a scale-out architecture that is optimal for any kind of developer who builds scalable applications through agile methodologies. Ultimately, MongoDB fosters business agility, scalability and innovation. Key MongoDB advantages include: Rich JSON Documents Powerful query language Multi-cloud data distribution Security of sensitive data Quick storage and retrieval of data Capacity for huge volumes of data and traffic Design supports greater developer productivity Extremely reliable for mission-critical workloads Architected for optimal performance and efficiency Key advantages of MongoDB Atlas , MongoDB’s hosted database as a service, include: Multi-cloud data distribution Secure for sensitive data Designed for developer productivity Reliable for mission critical workloads Built for optimal performance Managed for operational efficiency To be clear, JSON documents are the most productive way to work with data as they support nested objects and arrays as values. They also support schemas that are flexible and dynamic. MongoDB’s powerful query language enables sorting and filtering of any field, regardless of how nested it is in a document. Moreover, it provides support for aggregations as well as modern use cases including graph search, geo-based search and text search. Queries are in JSON and are easy to compose. MongoDB provides support for joins in queries. MongoDB supports two types of relationships with the ability to reference and embed. It has all the power of a relational database and much, much more. Companies of all sizes can use MongoDB as it successfully operates on a large and mature platform ecosystem. Developers enjoy a great user experience with the ability to provision MongoDB Atlas clusters and commence coding instantly. A global community of developers and consultants makes it easy to get the help you need, if and when you need it. In addition, MongoDB supports all major languages and provides enterprise-grade support. Why DataSwitch as a partner for MongoDB? Automated schema re-design, data migration & code conversion DataSwitch is a trusted partner for cost-effective, accelerated solutions for digital data transformation, migration and modernization through a modern database platform. Our no-code and low-code solutions along with cloud data expertise and unique, automated schema generation accelerates time to market. We provide end-to-end data, schema and process migration with automated replatforming and refactoring, thereby delivering: 50% faster time to market 60% reduction in total cost of delivery Assured quality with built-in best practices, guidelines and accuracy Data modernization: How “DataSwitch Migrate” helps you migrate from RDBMS to MongoDB DataSwitch Migrate (“DS Migrate”) is a no-code and low-code toolkit that leverages advanced automation to provide intuitive, predictive and self-serviceable schema redesign from a traditional RDBMS model to MongoDB’s Document Model with built-in best practices. Based on data volume, performance, and criticality, DS Migrate automatically recommends the appropriate ETTL (Extract, Transfer, Transform & Load) data migration process. DataSwitch delivers data engineering solutions and transformations in half the timeframe of the existing typical data modernization solutions. Consider these key areas: Schema redesign – construct a new framework for data management. DS Migrate provides automated data migration and transformation based on your redesigned schema, as well as no-touch code conversion from legacy data scripts to MongoDB Atlas APIs. Users can simply drag and drop the schema for redesign and the platform converts it to a document-based JSON structure by applying MongoDB modeling best practices. The platform then automatically migrates data to the new, re-designed JSON structure. It also converts the legacy database script for MongoDB. This automated, user-friendly data migration is faster than anything you’ve ever seen. Here’s a look at how the schema designer works. Refactoring – change the data structure to match the new schema. DS Migrate handles this through auto code generation for migrating the data. This is far beyond a mere lift and shift. DataSwitch takes care of refactoring and replatforming (moving from the legacy platform to MongoDB) automatically. It is a game-changing unique capability to perform all these tasks within a single platform. Security – mask and tokenize data while moving the data from on-premise to the cloud. As the data is moving to a potentially public cloud, you must keep it secure. DataSwitch’s tool has the capability to configure and apply security measures automatically while migrating the data. Data Quality – ensure that data is clean, complete, trustworthy, consistent. DataSwitch allows you to configure your own quality rules and automatically apply them during data migration. In summary: first, the DataSwitch tool automatically extracts the data from an existing database, like Oracle. It then exports the data and stores it locally before zipping and transferring it to the cloud. Next, DataSwitch transforms the data by altering the data structure to match the re-designed schema, and applying data security measures during the transform step. Lastly, DS Migrate loads the data and processes it into MongoDB in its entirety. Process Conversion Process conversion, where scripts and process logic are migrated from legacy DBMS to a modern DBMS, is made easier thanks to a high degree of automation. Minimal coding and manual intervention are required and the journey is accelerated. It involves: DML – Data Manipulation Language CRUD – typical application functionality (Create, Read, Update & Delete) Converting to the equivalent of MongoDB Atlas API Degree of automation DataSwitch provides during Migration Schema Migration Activities DS Automation Capabilities Application Data Usage Analysis 70% 3NF to NoSQL Schema Recommendation 60% Schema Re-Design Self Services 50% Predictive Data Mapping 60% Process Migration Activities DS Automation Capabilities CRUD based SQL conversion (Oracle, MySQL, SQLServer, Teradata, DB2) to MongoDB API 70% Data Migration Activities DS Automation Capabilities Migration Script Creation 90% Historical Data Migration 90% 2 Catch Load 90% DataSwitch Legacy Modernization as a Service (LMaas): Our consulting expertise combined with the DS Migrate tool allows us to harness the power of the cloud for data transformation of RDBMS legacy data systems to MongoDB. Our solution delivers legacy transformation in half the time frame through pay-per-usage. Key strengths include: ● Data Architecture Consulting ● Data Modernization Assessment and Migration Strategy ● Specialized Modernization Services DS Migrate Architecture Diagram Contact us to learn more.

May 13, 2021
Developer

Distinguish Data, Get Insights Faster with Conditional Formatting in Charts

The latest release of MongoDB Charts adds Conditional Formatting; an exciting new feature that enables chart authors to highlight important changes in their chart data, based on a set of rules that they define. Conditional Formatting rules can be applied to table charts and number charts . Why use Conditional Formatting? For table charts, the data is densely packed into the visualisation using rows and columns. This is great for comparing many values simultaneously, but as the density increases it may become more difficult to find and focus on the data that matters. Many authors use Number charts to track key individual metrics within their data. While the number itself can be useful, sometimes it isn’t enough to convey other necessary information for its context – for instance, is a high number good or bad? Conditional Formatting can aid users in understanding the data by applying different styles based on rules to highlight what is important, and to provide them with more context. See Conditional Formatting in Action with Formula 1 Data Formula 1 motorsport is what I like to refer to as the “sport of nerds”, because analyzing and understanding huge amounts of data, and being able to quickly make a decision on that analysis can be the difference between winning and losing. So let’s see how Conditional Formatting can help with this task using data from the 2021 FIA Formula 1 World Championship. Single Color Conditional Formatting Let’s start off with something simple. Below is a table showing the 2021 Drivers Championship after three rounds. A driver’s position in the championship is determined by the total number of points they have been awarded over successive rounds of the season. Let’s edit this chart and add Conditional Formatting to highlight the top three drivers in the championship with colors to represent 1st, 2nd, and 3rd place. Click on the Customize tab , and then click on the Conditional Formatting menu to expand the accordion. As you can see we haven’t yet defined any rules, so let’s add a new rule by clicking the + Add button. A drawer will open up from the left hand side of the screen displaying the Add Format Rules view. A conditional formatting rule must have at least one condition, and all conditions must match in order for the rule to be applied. Let’s highlight the row of the driver currently in 1st place by adding a single color rule with one condition. Since this rule will be determined by the driver’s current position in the championship, we need to add a condition to act on this data. We can target this field by selecting Pos from the Applies to select control. Now that we know what field we are targeting, we must next choose an operator to use for the comparison. Since we are only interested in data that matches a specific value, we select the Equal To numeric operator. Next we must provide an input for the operand to be compared to. For this rule we are only interested in highlighting the driver that is in first place in the championship, so we enter a value of 1 into the Input . You can think of this condition as saying; “is the value of the field Pos equal to the value of 1?” If it is, then apply the styling of this rule, otherwise do not. Finally we choose what styling changes should be applied by choosing from the options under Styling . In this example, we want to highlight the background color of the cell in a gold color to signify 1st place, and we will also apply a bold font weight to the text to make it more prominent. Additionally we also would like for these styles to be applied to the entire row, and not just the cell that the condition is applicable to, so we will check the Format entire row option. And that’s it! Once we save the rule, you’ll notice that the table re-renders in the Chart Builder Preview to show that the data is being evaluated correctly and the Conditional Formatting rule is applied. We then simply rinse and repeat this process to add additional rules to highlight the drivers in 2nd and 3rd place, resulting in the following output: Color Scale Conditional Formatting When comparing tabular data, sometimes it is desirable to use color to show where each value lies relative to other values in the column. The table below shows the race results for the third round of the 2021 FIA Formula 1 World Championship. Each row displays the final result for each driver taking part in the race. Let’s compare the Average Speed of each driver’s Fastest Lap using a Color Scale Conditional Formatting rule. Navigate to the Add Format Rule screen in the same way by going to “Customize > Conditional Formatting > + Add” , but this time select the Color Scale radio option. Note: Conditional Formatting Color Scale rules can only have one condition, and this condition can only be applied to fields in a Table Chart encoded as Value columns. Select the Fastest Lap Average Speed as the target for the condition. You’ll notice that unlike the discrete Single Color rules, there are no other settings to configure for the condition. This is because a Color Scale will compare the values across the documents in sort order, and will determine a background color to apply to the cell based on the rank of the value within the range. Since we are interested in finding the highest Average Speed across each driver’s fastest lap, we will select a sequential color scale, where higher values are colored green, and lower values are colored in white. Save the rule to see the changes applied. As you can see, for the third round of the 2021 FIA Formula 1 World Championship, the fastest lap average speed was set by Valtteri Bottas at a blistering speed of 209.74 km/h (130.32 mph)! And there we have it. I hope this brief introduction to Conditional Formatting has highlighted (pun intended) the capabilities of this exciting new feature! In this post we’ve only scratched the surface of what’s possible though, Conditional Formatting has many many more powerful operators than what we have demonstrated here, including matching values by range, regular expression and even ranks. Why not take it for a test drive yourself to see what is possible? If you haven’t tried Charts yet, it’s quick, easy and free to get started. Simply sign up for MongoDB Cloud , deploy a free Atlas cluster and click Charts in the top navigation bar. You can also ask questions on the MongoDB Developer Community Forums , or suggest new or improved features using the MongoDB Feedback Engine .

May 13, 2021
Developer

Exploring Data with MongoDB Atlas, Databricks, and Google Cloud

MongoDB Atlas supports Google Cloud (GC), enabling you to easily spin up managed MongoDB clusters within GC in minutes. We’re excited to share that Databricks recently launched Databricks on GC, giving customers the freedom to move and analyze their data within GC and MongoDB Atlas. With the latest update to Databricks, it’s now easier to get started with a cloud-first approach on GC that leverages MongoDB Atlas with its flexible data model designed for modern applications and Databricks for more advanced analytics use cases. The following tutorial illustrates how to use MongoDB Atlas on GC and Databricks. We’ll use sample sales data in MongoDB Atlas and calculate the rolling average using Databricks on GC. This tutorial covers the following: How to read data from MongoDB Atlas on GC into Spark How to run the MongoDB Connector for Spark as a library in Databricks How to use the PySpark libraries to perform rolling averages of sales data How to write these averages back to MongoDB so they are accessible to applications Create Databricks Workspace To provision a new Databricks workspace, you will need to have a GC project already created. If you do not already have a Databricks cluster deployed on GC, follow the online documentation to create one. Note: It is important to follow the documentation, because there are a few key settings you will need to make in your GC project, such as enabling container.googleapis.com, storage.googleapis.com, and deploymentmanager.googleapis.com services and adjusting certain Google Cloud quotas before creating your Databricks cluster. In this example we have already created the Google Cloud project mongodb-supplysales and are ready to go to the Google Marketplace and add Databricks to our project. Within your Google project, click on “Marketplace” and enter “Databricks” in the search box. Click on the resulting tile and follow the instructions. Once your Databricks cluster is created, navigate to the Databricks cluster with the URL provided. Here you can create a new workspace. Once you’ve created your workspace, you will be able to launch it from the URL provided: Logging into your workspace brings up the following welcome screen: In this article, we will create a notebook to read data from MongoDB and use the PySpark libraries to perform the rolling average calculation. We can create our Databricks cluster by selecting the “+ Create Cluster” button from the Clusters menu. Note: For the purposes of this walkthrough we chose only one worker and preemptible instances; in a production environment you would want to include more workers and autoscaling. Before we create our cluster, we have the option under Advanced Options to provide Spark configuration variables. One of the common settings for Spark config is to define spark.mongodb.output.uri and spark.mongodb.input.uri . First we need to create the MongoDB Atlas cluster so we have a connection string to enter for these values. At this point, open a new browser tab and navigate to MongoDB Atlas. Prepare a MongoDB Atlas Instance Once in the MongoDB Atlas portal, you will need to do the following before you can use Atlas with Databricks: Create your MongoDB Atlas cluster Define user credentials for use in the Spark connector Define network access Add sample data (optional for this article) Create Your MongoDB Atlas Cluster If you already have a MongoDB Atlas account, log in and create a new Atlas cluster. If you do not have an account, you can set up a free cluster at the following URL: https://www.mongodb.com/cloud . Once your account is set up, you can create a new Atlas cluster by using the “+ New Cluster” dialog. MongoDB provides a free tier for Google Cloud. Once you provide a cluster name and click on “create,” Atlas will take approximately five to seven minutes to create your Atlas cluster. Define Database Access By default there are no users created in an Atlas cluster. To create an identity for our Spark cluster to connect to MongoDB Atlas, launch the “Add New Database User” dialog from the Database Access menu item. Notice that there are three options for authentication to MongoDB Atlas: Password, Certificate, and AWS IAM authentication. Select “Password,” and enter a username and password. Atlas provides granular access control: For example, you could restrict this user account to work only with a specific Atlas cluster or define the account as temporary and have Atlas expire within a specific time period. Defining Network Access MongoDB Atlas does not allow any connection from the internet by default. You need to include MongoDB Atlas as part of a VPC peering or AWS PrivateLink configuration. If you do not have that set up with your cloud provider, you need to specify from which IP addresses Atlas can accept incoming connections. You can do this via the “Add IP Address” dialog in the Network Access menu. In this article, we will add “0.0.0.0,” allowing access from anywhere, because we don’t know specifically which IP our Databricks cluster will be running on. MongoDB Atlas can also make this IP access list temporary, which is great for situations where you need to allow access from anywhere. Add Sample Data Now that we have added our user account and allowed network access to our Atlas cluster, we need to add some sample data. Atlas provides several sample collections that are accessible from the menu item on the cluster. In this example, we will use the sales collection within the sample_supplies database. Update Spark Configuration with Atlas Connection String Copy the MongoDB Atlas connection string by clicking on the Connect button and selecting “Connect your application.” Copy the contents of the connection string and note the placeholders for username and password . You will have to change those to your own credentials. Return to your Databricks workspace. Under Advanced Options in your Databricks workspace, paste the connection string for both the spark.mongodb.output.uri and spark.mongodb.input.uri variables. Note that you will need to update the credentials in the MongoDB Atlas connection string with those you defined previously. For simplicity in your PySpark code, change the default database in the connection string from MyFirstDatabase to sample_supplies. (This is optional, because you can always define the database name via Spark configuration options at runtime.) Start the Databricks Cluster Now that your Spark config is set, start the cluster. Note: If the cluster fails to start, check the event log and view the JSON tab. This is an example error message you will receive if you forgot to increase the SSD storage quota: Add MongoDB Spark Connector Once the cluster is up and running, click on “Install New” from the Libraries menu. Here we have a variety of ways to create a library, including uploading a JAR file or downloading the Spark connector from Maven. In this example, we will use Maven and specify org.mongodb.spark:mongo-spark-connector_2.12:3.0.1 as the coordinates. Click on “Install” to add our MongoDB Spark Connector library to the cluster. Note: If you get the error message “Maven libraries are only supported on Databricks Runtime version 7.3 LTS, and versions >= 8.1,” you can download the MongoDB Spark Connector JAR file from https://repo1.maven.org/maven2/org/mongodb/spark/mongo-spark-connector_2.12/3.0.1/ and then upload it to Databricks by using the Upload menu option. Create a New Notebook Click on the Databricks home icon from the menu and select “Create a blank notebook.” Attach this new notebook to the cluster you created in the previous step. Because we defined our MongoDB connection string as part of the Spark conf cluster configuration, your notebook already has the MongoDB Atlas connection context. In the first cell, paste the following: from pyspark.sql import SparkSession pipeline="[{'$match': { 'items.name':'printer paper' }}, {'$unwind': { path: '$items' }}, {'$addFields': { totalSale: { \ '$multiply': [ '$items.price', '$items.quantity' ] } }}, {'$project': { saleDate:1,totalSale:1,_id:0 }}]" salesDF = spark.read.format("mongo").option("collection","sales").option("pipeline", pipeline).option("partitioner", "MongoSinglePartitioner").load() Run the cell to make sure you can connect the Atlas cluster. Note: If you get an error such as “MongoTimeoutException,” make sure your MongoDB Atlas cluster has the appropriate network access configured. The notebook gave us a schema view of what the data looks like. Although we could have continued to transform the data in the Mongo pipeline before it reached Spark, let’s use PySpark to transform it. Create a new cell and enter the following: from pyspark.sql.window import Window from pyspark.sql import functions as F salesAgg=salesDF.withColumn('saleDate', F.col('saleDate').cast('date')).groupBy("saleDate").sum("totalSale").orderBy("saleDate") w = Window.orderBy('saleDate').rowsBetween(-7, 0) df = salesAgg.withColumn('rolling_average', F.avg('sum(totalSale)').over(w)) df.show(truncate=False) Once the code is executed, the notebook will display our new dataframe with the rolling averages column: It is this cell where we will provide some additional transformation of the data such as grouping the data by saleDate and provide a summation of the totalSale per day. Once the data is in our desired format, we define a window of time as the past seven entries and then add a column to our data frame that is a rolling average of the total sales data. Once we have performed our analytics, we can write the data back to MongoDB for additional reporting, analytics, or archiving. In this scenario, we are writing the data back to a new collection called sales-averages: df.write.format("mongo").option("collection","sales-averages").save() You can see the data by using the Collections tab within the MongoDB Atlas cluster UI. WIth the data in MongoDB Atlas, you can now leverage many of the services available, including Atlas Online Archive, Atlas Search, and Atlas Data Lake. Summary The integration between MongoDB Atlas, Google Cloud, and Databricks enables you to gain deep insights into your data and gives you freedom to move and analyze data as your needs evolve. Check out the resources below for more information: Getting started with MongoDB Atlas MongoDB Spark Connector MongoDB Atlas on Google Cloud

May 11, 2021
Developer

MongoDB Query API Webinar: FAQ

Last week we held a live webinar on the MongoDB Query API and our lineup of idiomatic programming language drivers. There were many great questions during the session, and in this post, what I want to do is share the most frequently asked ones with you. But first - here is a quick summary of what MongoDB Query API is all about if you are unfamiliar with it. What is MongoDB Query API? MongoDB is built upon the document data model . The document model is designed to be intuitive, flexible, universal, and powerful. You can easily work with a variety of data, and because documents map directly to the objects in your code, it fits naturally in your app development experience. MongoDB Query API lets you work with data as code and build any class of application faster by giving you extensive query capabilities natively in any modern programming language. Whether you’re working with transactional data, looking for search capabilities, or trying to run sophisticated real-time analytics, MongoDB Query API can meet your needs. MongoDB Query API has some unique features like its expressive query, primary and secondary indexes, powerful aggregations and transformations, on-demand materialized views, and more — enabling you to work with data of any structure, at any scale. Some key features to highlight: Indexes To optimize any workload and query pattern you can take advantage of a large set of index types like multi-key (for arrays), wildcard, geospatial, and more and index any field no matter how deeply nested it is within your documents. Fully featured secondary indexes are document-optimized and include partial, unique, case insensitive, and sparse. Aggregation Pipeline Aggregation pipeline lets you group, transform, and analyze your data to support any class of workload. You can choose from dozens of aggregation stages and over 200 operators to build modular and expressive pipelines. You can also use low-code tools like MongoDB Compass to drag and drop stages, examine intermediate output, and export to your programming language of choice. On-Demand Materialized Views The powerful $merge aggregation stage allows you to combine the results of your aggregation pipeline with existing collections to update and enrich data without having to recompute your entire data set. You can output results to sharded and unsharded collections while simultaneously defining indexes on each view Geospatial and Graph Utilize MongoDB’s built-in natively ability to store and run queries against geospatial data Use operators like $graphLookup to quickly traverse connected data sets These are just a few of the features we highlighted in the MongoDB Query API webinar. No matter what type of application you are thinking of building or managing, MongoDB Query API can meet your needs as the needs of your users and application change. FAQs for MongoDB Query API Here are the most common questions asked during the webinar: Do we have access to the data sets presented in this webinar? Yes, you can easily create a cluster and load the sample data sets into Atlas. Instructions on how to get started are here . How can I access full-text search capabilities? Text search is a standard feature of MongoDB Atlas. You can go to cloud.mongodb.com to try it out using sample data sets. Does VS code plugin support Aggregation? Yes, it does. You can learn more about the VS code plugin on our docs page. If you need to pass variable values in the aggregation, say the price range from the app as an input, how would you do that? This is no different than sending a query - since you construct your aggregation in your application you just fill in the field you want with value/variable in your code. Is there any best practice document on MongoDB query API to have stable performance and utilize minimum resources? Yes, we have tips and tricks on optimizing performance by utilizing indexes, filters, and tools here . Does MongoDB support the use of multiple different indexes to meet the needs of a single query? Yes, this can be accomplished by the use of compound indexes. You can learn more about it in our docs here . If you work with big data and create a collection, is it smarter to create indexes first or after the collection is filled (regarding the time to create a collection)? It is better to create the indexes first as they will take less time to create if the collection is empty, but you still have an option to create the index once the data is there in the collection. There are multiple great benefits of MongoDB’s indexing capabilities: When building indexes, there is no impact on your app’s availability since the index operation is online. Flexibility to add and remove indexes at any time. Ability to hide indexes to evaluate the impact of removing them before officially dropping them. Where do I go to learn more? Here are some resources to help you get started: MongoDB Query API page MongoDB University MongoDB Docs You can also check out the webinar replay here .

May 7, 2021
Developer

How to Get Started with MongoDB Atlas and Confluent Cloud

Every year more and more applications are leveraging the public cloud and reaping the benefits of elastic scale and rapid provisioning. Forward-thinking companies such as MongoDB and Confluent have embraced this trend, building cloud-based solutions such as MongoDB Atlas and Confluent Cloud that work across all three major cloud providers. Companies across many industries have been leveraging Confluent and MongoDB to drive their businesses forward for years. From insurance providers gaining a customer-360 view for a personalized experience to global retail chains optimizing logistics with a real-time supply chain application, the connected technologies have made it easier to build applications with event-driven data requirements. The latest iteration of this technology partnership simplifies getting started with a cloud-first approach, ultimately improving developer’s productivity when building modern cloud-based applications with data in motion. Today, the MongoDB Atlas source and sink connectors are generally available within Confluent Cloud. With Confluent’s cloud-native service for Apache Kafka® and these fully managed connectors, setup of your MongoDB Atlas integration is simple. There is no need to install Kafka Connect or the MongoDB Connector for Apache Kafka, or to worry about scaling your deployment. All the infrastructure provisioning and management is taken care of for you, enabling you to focus on what brings you the most value — developing and releasing your applications rapidly. Let’s walk through a simple example of taking data from a MongoDB cluster in Virginia and writing it into a MongoDB cluster in Ireland. We will use a python application to write fictitious data into our source cluster. Step 1: Set up Confluent Cloud First, if you’ve not done so already, sign up for a free trial of Confluent Cloud . You can then use the Quick Start for Apache Kafka using Confluent Cloud tutorial to create a new Kafka cluster. Once the cluster is created, you need to enable egress IPs and copy the list of IP addresses. This list of IPs will be used as an IP Allow list in MongoDB Atlas. To locate this list, select “Custer Settings” and then the “Networking” tab. Keep this tab open for future reference: you will need to copy these IP addresses into the Atlas cluster in Step 2. Step 2: Set Up the Source MongoDB Atlas Cluster For a detailed guide on creating your own MongoDB Atlas cluster, see the Getting Started with Atlas tutorial. For the purposes of this article, we have created an M10 MongoDB Atlas cluster using the AWS cloud in the us-east-1 (Virginia) data center to be used as the source, and an M10 MongoDB Atlas cluster using the AWS cloud in the eu-west-1 (Ireland) data center to be used as the sink. Once your clusters are created, you will need to configure two settings in order to make a connection: database access and network access. Network Access You have two options for allowing secure network access from Confluent Cloud to MongoDB Atlas: You can use AWS PrivateLink, or you can secure the connection by allowing only specific IP connections from Confluent Cloud to your Atlas cluster. In this article, we cover securing via IPs. For information on setting up using PrivateLink, read the article Using the Fully Managed MongoDB Atlas Connector in a Secure Environment . To accept external connections in MongoDB Atlas via specific IP addresses, launch the “IP Access List” entry dialog under the Network Access menu. Here you add all the IP addresses that were listed in Confluent Cloud from Step 1. Once all the egress IPs from Confluent Cloud are added, you can configure the user account that will be used to connect from Confluent Cloud to MongoDB Atlas. Configure user authentication in the Database Access menu. Database Access You can authenticate to MongoDB Atlas using username/password, certificates, or AWS identity and access management (IAM) authentication methods. To create a username and password that will be used for connection from Confluent Cloud, select the “+ Add new Database User” option from the Database Access menu. Provide a username and password and make a note of this credential, because you will need it in Step 3 and Step 4 when you configure the MongoDB Atlas source and sink connectors in Confluent Cloud. Note: In this article we are creating one credential and using it for both the MongoDB Atlas source and MongoDB sink connectors. This is because both of the clusters used in this article are from the same Atlas project. Now that the Atlas cluster is created, the Confluent Cloud egress IPs are added to the MongoDB Atlas Allow list, and the database access credentials are defined, you are ready to configure the MongoDB Atlas source and MongoDB Atlas sink connectors in Confluent Cloud. Step 3: Configure the Atlas Source Now that you have two clusters up and running, you can configure the MongoDB Atlas connectors in Confluent Cloud. To do this, select “Connectors” from the menu, and type “MongoDB Atlas” in the Filters textbox. Note: When configuring MongoDB Atlas source And MongoDB Atlas sink, you will need the connection host name of your Atlas clusters. You can obtain this host name from the MongoDB connection string. An easy way to do this is by clicking on the "Connect" button for your cluster. This will launch the Connect dialog. You can choose any of the Connect options. For purposes of illustration, if you click on “Connect using MongoDB Compass.” you will see the following: The highlighted part in the above figure is the connection hostname you will use when configuring the source and sink connectors in Confluent Cloud. Configuring the MongoDB Atlas Source Connector Selecting “MongoDbAtlasSource” from the list of Confluent Cloud connectors presents you with several configuration options. The “Kafka Cluster credentials” choice is an API-based authentication that the connector will use for authentication with the Kafka broker. You can generate a new API key and secret by using the hyperlink. Recall that the connection host is obtained from the MongoDB connection string. Details on how to find this are described at the beginning of this section. The “Copy existing data” choice tells the connector upon initial startup to copy all the existing data in the source collection into the desired topic. Any changes to the data that occur during the copy process are applied once the copy is completed. By default, messages from the MongoDB source are sent to the Kafka topic as strings. The connector supports outputting messages in formats such as JSON and AVRO. Recall that the MongoDB source connector reads change stream data as events. Change stream event metadata is wrapped in the message sent to the Kafka topic. If you want just the message contents, you can set the “Publish full document only” output message to true. Note: For source connectors, the number of tasks will always be “1”: otherwise you will run the risk of duplicate data being written to the topic, because multiple workers would effectively be reading from the same change stream event stream. To scale the source, you could create multiple source connectors and define a pipeline that looks at only a portion of the collection. Currently this capability for defining a pipeline is not yet available in Confluent Cloud. Step 4: Generate Test Data At this point, you could run your python data generator application and start inserting data into the Stocks.StockData collection at your source. This will cause the connector to automatically create the topic “demo.Stocks.StockData.” To use the generator, git-clone the stockgenmongo folder in the above-referenced repository and launch the data generation as follows: python stockgen.py -c "< >" Where the MongoDB connection URL is the full connection string obtained from the Atlas source cluster. An example connection string is as follows: mongodb+srv://kafkauser:kafkapassword123@democluster.lkyil.mongodb.net Note: You might need to pip-install pymongo and dnspython first. If you do not wish to use this data generator, you will need to create the Kafka topic first before configuring the MongoDB Atlas sink. You can do this by using the Add a Topic dialog in the Topics tab of the Confluent Cloud administration portal. Step 5: Configuring the MongoDB Atlas Sink Selecting “MongoDB Atlas Sink” from the list of Confluent Cloud connectors will present you with several configuration options. After you pick the topic to source data from Kafka, you will be presented with additional configuration options. Because you chose to write your data in the source by using JSON, you need to select “JSON” in the input message format. The Kafka API key is an API key and secret used for connector authentication with Confluent Cloud. Recall that you obtain the connection host from the MongoDB connection string. Details on how to find this are described previously at the beginning of Step 3. The “Connection details” section allows you to define behavior such as creating a new document for every topic message or updating an existing document based upon a value in the message. These behaviors are known as document ID and write model strategies. For more information, check out the MongoDB Connector for Apache Kafka sink documentation . If order of the data in the sink collection is not important, you could spin up multiple tasks to gain an increase in write performance. Step 6: Verify Your Data Arrived at the Sink You can verify the data has arrived at the sink via the Atlas web interface. Navigate to the collection data via the Collections button. Now that your data is in Atlas, you can leverage many of the Atlas platform capabilities such as Atlas Search, Atlas Online Archive for easy data movement to low-cost storage, and MongoDB Charts for point-and-click data visualization. Here is a chart created in about one minute using the data generated from the sink cluster. Summary Apache Kafka and MongoDB help power many strategic business use cases, such as modernizing legacy monolithic systems, single views, batch processing, and event-driven architectures, to name a few. Today, Confluent and MongoDB Cloud and MongoDB Atlas provide fully managed solutions that enable you to focus on the business problem you are trying to solve versus spinning your tires in infrastructure configuration and maintenance. Register for our joint webinar to learn more!

May 6, 2021
Developer

Use x509 certificate-based authentication with MongoDB and Apache Kafka

Kafka has emerged as a popular event streaming platform. The inherent "pub/sub" model can be viewed as a method for moving data between systems. As such, MongoDB offers a Kafka connector , enabling Kafka topics to be copied into a MongoDB cluster (the sink). Similarly, the connector enables data movement from a MongoDB cluster (the source) into Kafka topics. To access data securely, certificate-based X.509 authentication is a natural choice for server-to-server authentication scenarios with Kafka and MongoDB. Certificates avoid having to store or manage usernames and passwords when used with database connection strings. For example, such user credentials could be inadvertently exposed if "hard-coded" in configuration files or other uses. An X.509 certificate is a structured, binary record. This record consists of several key and value pairs. X.509 certificates use the widely accepted international X.509 public key infrastructure (PKI) standard. The use of certificates prevents user credential exposure. Authentication requests with certificates verifies that any public key presented by a client or another member of the cluster belongs to that client or member. The X.509 certificate method for authentication is more secure than conventional password-based certification because each server machine needs their own dedicated key to participate in the cluster. For use with secure TLS/SSL connections, MongoDB supports X.509 certificate authentication allowing clients to use public key infrastructure in lieu of SCRAM (username and password). The certificate encodes two very important pieces of information: the server's public key and a digital signature that can be used to confirm the certificate's authenticity. Additionally, the certificate will include metadata used by the Certificate Authority to track the certificate and provide guidelines on how the public key can be used. Using the server's public key, the client and server are able to negotiate a shared symmetric key securely, which can be used to secure communications. Users can either generate their own certificates and keys (self-managed) or use the Atlas PKI. In either case, first a project-specific CA private and public key is generated, and then a per-user private key and signed X.509 identity certificate is created. If using self-managed X.509 infrastructure , you'll need to upload your CA public key certificate into your Atlas project. If using Atlas-managed X.509 infrastructure, you'll need to download the project private key and provide that to your Kafka Connect service. This signed certificate is then pushed to each server member in your Atlas cluster. The below diagram shows the deployment of a standard 3 node replica set and client using x.509 authentication: In non-production environments, the basic SCRAM authentication method may be most suitable. However, for production environments or server-server scenarios such as a Kafka-MongoDB integration, X.509 authentication is the recommended mechanism. To use X.509 certification for server-server authentication, first confirm that you are able to authenticate to an Atlas cluster using X.509 certificates. Then follow the steps below. Prerequisites: Openssl must be installed Project-level CA & user certificates created in PEM format If using Atlas-managed certificates, user-specific client certificate (see X.509 tab: https://docs.atlas.mongodb.com/security-add-mongodb-users/#database-user-authentication ) If using self-managed X.509 auth, you will need to create & upload your CA public key to Atlas (see https://docs.atlas.mongodb.com/security-self-managed-x509/ ), and have a user-specific client certificate ready Ensure that you have installed the MongoDB Kafka Connector and understand how to use it with Kafka Connect. Then follow these steps: Obtain the client user certificate from your system administrator (or from Atlas). In this example, the user certificate is stored in PEM file kafkaclient-X509-cert.pem and will be associated with the Atlas database user kafka-svc . Convert the PEM file to a password-protected PKCS12 formatted certificate by running this command: openssl pkcs12 -export -in kafkaclient-X509-cert.pem -out kafkaclient-X509-cert.p12 -password pass:mypassword Copy PKCS12 certificate ( kafkaclient-x509-cert.p12 ) to the server where Kafka Connect is running. Note the full path of the PKCS12 certificate location. Update the Kafka Connect configuration in the KAFKA_OPTS environment variable: export KAFKA_OPTS="-Djavax.net.ssl.keyStore=<path to kafkaclient-x509-cert.p12> -Djavax.net.ssl.keyStorePassword=mypassword -Djavax.net.ssl.keyStoreType=PKCS12" Restart Kafka Connect Update the MongoDB Connector configuration to use a connection URI with the following parameter options: Connection.uri: "mongodb+srv://<mongodb-host>/test?authSource=%24external&authMechanism=MONGODB-X509&subjectName=kafka-svc" Re-deploy the MongoDB connector using the Kafka Connect REST API, with the above configuration for the connection URI. Download the latest MongoDB Connector for Apache Kafka 1.5 from the Confluent Hub ! Read the MongoDB Connector for Apache Kafka documentation . Questions/Need help with the connector? Ask the Community .

May 5, 2021
Developer

Ready to get Started with MongoDB Atlas?

Start Free