137 results

Introducing the Certified MongoDB Atlas Connector for Power BI

This is a collaborative post from MongoDB and Microsoft. We thank Alexi Antonino, Natacha Bagnard, Jad Jarouche from MongoDB, and Bob Zhang, Mahesh Prakriya, and Rajeev Jain from Microsoft for their contributions. Introducing MongoDB Atlas Connector for Power BI, the certified solution that facilitates real-time insights on your Atlas data directly in the Power BI interfaces that analysts know and love! Supporting Microsoft’s Intelligent Data Platform , this integration bridges the gap between Developers and Analytics teams, allowing analysts who rely on Power BI for insights to natively transform, analyze, and share dashboards that incorporate live MongoDB Atlas data. Available in June , the Atlas Power BI Connector empowers companies to harness the full power of their data like never before. Let’s take a deeper look into how the Atlas Power BI Connector can unlock comprehensive, real-time insights on live application data that will help take your business to the next level. Effortlessly model document data with Power Query The Atlas Power BI Connector makes it easy to model document data with native Power BI features and data modeling capabilities. With its SQL-92 compatible dialect, mongosql, you can tailor your data to fit any requirements by transforming heavily nested document data to fit your exact needs, all from your Power Query dashboard. Gain real-time insights on live application data By using the Power BI Connector to connect directly to MongoDB Atlas, you can build up-to-date dashboards in Power BI Desktop and scale insights to your organization through Power BI Service with ease. With no delays caused by data duplication, you can stay ahead of the curve by unlocking real-time insights on Atlas data that are relevant to your business. Empower cross-source data analysis The Power BI Connector's integration with MongoDB Atlas enables you to seamlessly model, analyze, and share insightful dashboards that are built from multiple data sources. By combining Atlas's powerful Data Federation capabilities with Power BI's advanced analytics and visualization tools, you can easily create comprehensive dashboards that offer valuable insights into your data, regardless of where it is stored. See it in action Log in and activate the Atlas SQL Interface to try out the Atlas Power BI Connector ! If you are new to Atlas or Power BI, get started for free today on Azure Marketplace or Power BI Desktop .

May 23, 2023

MongoDB Atlas Integrations for AWS CloudFormation and CDK are now Generally Available

Infrastructure as Code (IaC) tools allows developers to manage and provision infrastructure resources through code, rather than through manual configuration. IaC have empowered developers to apply similar best practices from software development to application instructure deployments. This includes: Automation - helping to ensure repeatable, consistent, and reliable infrastructure deployments Version Control - check in IaC code into GitHub, BitBucket, AWS CodeCommit, or GitLab for improved team collaboration and higher code quality Security - create clear audit trails of each infrastructure modification Disaster Recovery - IaC scripts can be used to quickly recreate infrastructure in the event of availability zone or region outages Cost Savings - prevent overprovisioning and waste of cloud resources Improved Compliance - easier to enforce organizational policies and standards Today we are doubling down on this commitment and announcing MongoDB Atlas integrations with AWS CloudFormation and Cloud Development Kit (CDK). AWS CloudFormation allows customers to define and provision infrastructure resources using JSON or YAML templates. CloudFormation provides a simple way to manage infrastructure as code and automate the deployment of resources. AWS Cloud Development Kit (CDK) is an open-source software development framework that allows customers to define cloud infrastructure in code and provision it through AWS CloudFormation. It supports multiple programming languages and allows customers to use high-level abstractions to define infrastructure resources. These new integrations are built on top of the Atlas Admin API and allow users to automate infrastructure deployments by making it easy to provision, manage, and control Atlas Infrastructure as Code in the cloud. MongoDB Atlas & AWS CloudFormation: To meet developers where they are, we now have multiple ways to get started with MongoDB Atlas using AWS Infrastructure as Code. Each of these allow users to provision, manage, and control Atlas infrastructure as code on AWS: Option 1: AWS CloudFormation Customers can begin their journey using Atlas resources directly from the AWS CloudFormation Public Registry . We currently have 33 Atlas Resources and will continue adding more. Examples of available Atlas resources today include: Dedicated Clusters, Serverless Instances, AWS PrivateLink , Cloud Backups, and Encryption at Rest using Customer Key Management. In addition, we have published these resources to 22 (and counting) AWS Regions where MongoDB Atlas is supported today. Learn how to get started via this quick demo . Option 2: AWS CDK After its launch in 2019 as an open source project, AWS CDK has gained immense popularity among the developer community with over a thousand external contributors and more than 1.3 million weekly downloads. AWS CDK abstracts away the low-level details of cloud infrastructure, making it easier for developers to define and manage their infrastructure natively in their programming language of choice. This helps to simplify the deployment process and eliminates context switching. Under the hood, AWS CDK synthesizes CloudFormation templates on your behalf which is then deployed to AWS accounts. In AWS CDK, L1 (Level 1) and L2 (Level 2) constructs refer to two different levels of abstraction for defining infrastructure resources: L1 constructs are lower-level abstractions that provide a one-to-one mapping to AWS CloudFormation resources. They are essentially AWS CloudFormation resources wrapped in code, making them easier to use in a programming context. L2 constructs are higher-level abstractions that provide a more user-friendly and intuitive way to define AWS infrastructure. They are built on top of L1 constructs and provide a simpler and more declarative API for defining resources. Today we announce MongoDB Atlas availability for AWS CDK in JavaScript and TypeScript, with plans for Python, Java, Go, and .NET support coming later in 2023. Now customers can easily deploy and manage all available Atlas resources by vending AWS CDK applications with prebuilt L1 Constructs. We also have a growing number of L2 and L3 CDK Constructs available. These include Constructs to help users to quickly deploy the core resources they need to get started with MongoDB Atlas on AWS in just a few lines JavaScript or TypeScript (see awscdk-resources-mongodbatlas to learn more). Users can also optionally select to add more advanced networking configurations such as VPC peering and AWS PrivateLink. Option 3: AWS Partner Solutions (previously AWS Quick Starts) Instead of manually pulling together multiple Atlas CloudFormation resources, AWS Partner Solutions gives customers access to pre-built CloudFormation templates for both general and specific use cases with MongoDB Atlas. By using AWS Partner Solution templates, customers can save time and effort compared to architecting their deployments from scratch. These were jointly created and incorporate best practices from MongoDB Atlas and AWS. Go to the AWS Partner Solutions Portal to get started. Start building today! These MongDB Atlas integrations with AWS CloudFormation are free and open source licensed under Apache License 2.0 . Users only pay for underlying Atlas resources created and can get started with Atlas always free tier ( M0 clusters ). Getting started today is faster than ever with MongoDB Atlas and AWS CloudFormation. We can’t wait to see what you will build next with this powerful combination! Learn more about MongoDB Atlas integrations with AWS CloudFormation

February 28, 2023

COSMOS SQL Migration to MongoDB Atlas

Azure Cosmos DB is Microsoft's proprietary globally distributed, multi-model database service. Cosmos DB supports SQL interface as one of the models in addition to the Cosmos MongoDB API. Even customers with the SQL interface use COSMOS for the document model and the convenience of working with a SQL interface. We have seen customers struggle with scalability issues and costs with Cosmos DB and want to move to MongoDB Atlas. Migrating an application from Cosmos DB SQL to MongoDB Atlas involves both application refactoring and data migration from Cosmos to MongoDB. The current tool set for migrating data from Cosmos SQL to MongoDB Atlas is fairly limited. While the Azure datamigration tool can be used for a 1 time export, customers frequently need zero downtime for migrations which the datamigration tool cannot satisfy. All writes into the source COSMOS SQL should be discontinued before the data migration can be performed. This puts a lot of pressure on the customer in terms of downtime requirements and planning out the migration. PeerIslands has built a COSMOS SQL migrator tool that addresses these concerns. The tool provides a way to perform COSMOS SQL migration with near zero downtime. The architecture of the tool is explained below. Initial Snapshot The tool uses the native datamigrationtool to export data as JSON files from Azure Cosmos DB SQL API. The Data Migration tool is an open-source solution that imports/exports data to/from Azure Cosmos DB. The exported data in JSON format is then imported into MongoDB Atlas using the mongoimport. Figure 1: Initial Snapshot processing stages. Change data capture Using the combination of the above tools we complete the initial snapshot. But what happens to documents that are updated or newly inserted during migration? Just prior to the initial snapshot process being started, the migration tool starts the change capture process. The migration tool listens to the ongoing changes in CosmosDB using the Kafka Source Connector provided by Azure and pushes the changes to a Kafka topic. Optionally KSQL can be used to perform any transformation required. Once the changes are in Kafka, the migration tool uses the Atlas Sink Connector to push the ongoing message to the Atlas Cluster. Below is the diagram depicting the flow of change stream messages from Cosmos SQL to MongoDB. Figure 2: The flow of change stream messages from Cosmos SQL to MongoDB The COSMOS SQL migration tool provides a GUI based point & click interface that brings together the above capabilities for handling the entire migration process. Since the tool is capable of change data capture, the tool provides a lot of flexibility for migrating your data without any downtime. Figure 3: Cosmos SQL migration tool dashboard In addition to data migration, PeerIslands can help with complete application refactoring required for migrating out of COSMOS SQL interface. Reach out to if you need to migrate from COSMOS SQL to MongoDB Atlas.

January 31, 2023

Improving Building Sustainability with MongoDB Atlas and Bosch

Every year developers from more than 45 countries head to Berlin to participate in the Bosch Connected Experience (BCX) hackathon — one of Europe’s largest AI and Internet of Things (AIoT) hackathons. This year, developers were tasked with creating solutions to tackle a mix of important problems, from improving sustainability in commercial building operations and facility management to accelerating innovation of automotive-grade, in-car software stacks using a variety of hardware and software solutions made available through Bosch, Eclipse, and their ecosystem partners. MongoDB also took part in this event and even helped one of the winning teams build their solution on top of MongoDB Atlas. I had the pleasure of connecting with a participant from that winning team, Jonas Bruns, to learn about his experience building an application for the first time with MongoDB Atlas. Ashley George: Tell us a little bit about your background and why you decided to join this year's BCX hackathon? Jonas Bruns: I am Jonas, an electrical engineering student from Friedrich Alexander University in Erlangen Nürnberg. Before I started my master’s program, I worked in the automotive industry in the Stuttgart area. I was familiar with the BCX hackathon from my time in Stuttgart and, together with two friends from my studies, decided to set off to Berlin this year to take part in this event. The BCX hackathon is great because there are lots of partners on site to help support the participants and provide knowledge on both the software and hardware solutions available to them — allowing teams to turn their ideas into a working prototype within the short time available. We like being confronted with new problems and felt this was an important challenge to take on, so participation this year was a must for us. AG: Why did you decide to use MongoDB Atlas for your project? JB: We started with just the idea of using augmented reality (AR) to improve the user experience (UX) of smart devices. To achieve this goal, we needed not only a smartphone app but also a backend in which all of our important data is stored. Due to both limited time and the fact that no one on our team had worked with databases before, we had to find a solution that would grow with our requirements and allow us to get started as easily as possible. Ideally, the solution would also be fully managed as well to eliminate us having to take care of security on our own. After reviewing our options, we quickly decided on using MongoDB Atlas . AG: What was it like working with MongoDB Atlas, especially having not worked with a database solution before? JB: The setup was super easy and went pretty fast. Within just a short time, we were able to upload our first set of data to Atlas using MongoDB Compass . As we started to dive in and explore Atlas a bit more we discovered the trigger functionality (Atlas Triggers), which we were able to use to simplify our infrastructure. Originally, we planned to use a server connected to the database, which would react to changed database entries. This would then send a request to control the desired periphery. The possibility to configure triggers directly in the database made a server superfluous and saved us a lot of time. We configured the trigger so that it executes a JavaScript function when a change is made to the database. This evaluates data from the database and executes corresponding requests, which directly control the periphery. Initially, we had hit a minor roadblock in determining how to handle the authentication needs (creating security tokens), which the periphery needs and expects during a request. To solve for this, we stored the security tokens on an AWS server which listens to an HTTP request. From Atlas, we then just have to call the URL and the AWS instance does the authentication and control of the lights. After we solved this problem, we were thrilled with how little configuration was needed and how intuitive Atlas is. The next steps, like connecting Atlas to the app, were easy. We achieved this by sending data from Flutter to Atlas over HTTPs with the Atlas Data API . AG: How did Atlas enable you to build your winning application? JB: By the end of the challenge, we had developed our idea into a fully functional prototype using Google ARcore, Flutter, MongoDB Atlas, and the Bosch Smart Home Hardware (Figure 1). We built a smartphone application that uses AR to switch on and off a connected light in a smart building. The position and state of the light (on or off) are stored in the database. If the state of the light should change, the app manipulates the corresponding value in the database. The change triggers a function that then sets the light to the desired state (on or off). The fact that we were able to achieve this within a short time without sufficient prior knowledge is mainly due to the ease and intuitive nature of Atlas. The simple handling allowed us to quickly learn and use the available features to build the functionality our app needed. Figure 1: Tech stack for the projects prototype. AG: What additional features within Atlas did you find the most valuable in building your application? JB: We created different users to easily control the access rights of the app and the smart devices. By eliminating the need for another server to communicate with the smart devices and using the trigger function of Atlas, we were able to save a lot of time on the prototype. In addition, the provided preconfigured code examples in various languages facilitated easy integration to our frontend and helped us avoid errors. Anyone who is interested can find the results of our work in the GitHub repo . AG: Do you see yourself using Atlas more in the future? JB: We will definitely continue to use Atlas in the future. The instance from the hackathon is still online, and we want to get to know the other functionalities that we haven't used yet. Given how intuitive Atlas was in this project, I am also sure that we will continue to use it for future projects as well. Through this project, Jonas and team were able to build a functional prototype that can help commercial building owners have more control over their buildings and take the steps to help reduce CO₂ emissions.

January 12, 2023

Demystifying Sharding with MongoDB

Sharding is a critical part of modern databases, yet it is also one of the most complex and least understood. At MongoDB World 2022 , sharding software engineer Sanika Phanse presented Demystifying Sharding in MongoDB , a brief but comprehensive overview of the mechanics behind sharding. Read on to learn about why sharding is necessary, how it is executed, and how you can optimize the sharding process for faster queries. Watch this deep-dive presentation on the ins and outs of sharding, featuring MongoDB sharding software engineer Sanika Phanse. What is sharding, and how does it work? In MongoDB Atlas , sharding is a way to horizontally scale storage and workloads in the face of increased demand — splitting them across multiple machines. In contrast, vertical scaling requires the addition of more physical hardware, for example, in the form of servers or components like CPUs or RAM. Once you’ve hit the capacity of what your servers can support, sharding becomes your solution. Past a certain point, vertical scaling requires teams to spend significantly more time and money to keep pace with demand. Sharding, however, spreads data and traffic across your servers, so it’s not subject to the same physical limitations. Theoretically, sharding could enable you to scale infinitesimally, but, in practice, you are scaling proportionally to the number of servers you add. Each additional shard increases both storage and throughput, so your servers can simultaneously store more data and process more queries. How do you distribute data and workloads across shards? At a high level, sharding data storage is straightforward. First, a user must specify a shard key, or a subset of fields to partition their data by. Then, data is migrated across shards by a background process called the balancer , which ensures that each shard contains roughly the same amount of data. Once you specify what your shard key will be, the balancer will do the rest. A common form of distribution is ranged sharding, which assigns data to various shards through a range of shard keys. Using this approach, one shard will contain all the data with shard keys ranging from 0-99, the next will contain 100-199, and so forth. In theory, sharding workloads is also simple. For example, if you receive 1,000 queries per second on a single server, sharding your workload across two servers would divide the number of queries per second equally, where each server receives 500 queries per second. . However, these ideal conditions aren’t always attainable, because workloads aren’t always evenly distributed across shards. Imagine a group of 50,000 students, whose grades are split between two shards. If half of them decide to check their grades — and all of their records happen to fall in the same shard ID range — then all their data will be stored on the same shard. As a result, all the traffic will be routed to one shard server. Note that both of these examples are highly simplified; real-world situations are not as neat. Shards won’t always contain a balanced range of shard IDs, because data might not be evenly divided across shards. Additionally, 50,000 students, while large, is still too small of a sample size to be in a sharded cluster. How do you map and query sharded data? Without an elegant solution, users may encounter latency or failed queries when they try to retrieve sharded data. The challenge is to tie together all your shards, so it feels like you’re communicating with one database, rather than several. This solution starts with the config server, which holds metadata describing the sharded cluster, as well as the most up-to-date routing table, which maps shard keys to shard connection strings. To increase efficiency, routers regularly contact the config server to create a cached copy of this routing table. Nonetheless, at any given point in time, the config server’s version of the routing table can be considered the single source of truth. To query sharded data, your application sends your command to the team of routers. After a router picks up the command, it will then use the shard key from the command’s query, in conjunction with its cached copy of the routing table, to direct the query to the correct location. Rather than using the entire document, the user will only select one field (or combination of fields) to serve as the shard key. Then, the query will make its way to the correct shard, execute the command, update, and return a successful result to the router. Operations aren’t always so simple, especially when queries do not specify shard keys. In this case, the router realizes that it is unaware of where your data exists. Thus, it sends the query to all the shards, and then it waits to gather all the responses before returning to the application. Although this specific query is slow if you have many shards, it might not pose a problem if this query is infrequent or uncommon. How do you optimize shards for faster queries? Shard keys are critical for seamless operations. When selecting a shard key, use a field that matches on all (or most) of your data and has a high cardinality. This step ensures granularity among shard key values, which allows the data to be distributed evenly across shards. Additionally, your data can be resharded as needed, to fit changing requirements or to improve efficiency. Users can also accelerate queries with thoughtful planning and preparation, such as optimizing their data structures for the most common, business-critical query patterns. For example, if your workload makes lots of age-based queries and few _ID-based queries, then it might make sense to sort data by age to ensure more targeted queries. Hospitals are good examples, as they pose unique challenges. Assuming that the hospital’s patient documents would contain fields such as insurance, _ID value, and first and last names, which of these values would make sense as a shard key? Patient name is one possibility, but it is not unique, as many people might have the same name. Similarly, insurance can be eliminated, because there are only a handful of insurance providers, and people might not even have insurance. This key would violate both the high-cardinality principle, as well as the requirement that every document has this value filled. The best candidate for shard key would be the patient ID number or _ID value. After all, if one patient visits, that does not indicate whether another patient will (or will not) visit. As a result, the uniqueness of the _ID value will be very useful, as it will enable users to make targeted queries to the one document that is relevant to the patient. Faced with repeating values, users can also create compound shard keys instead. By including hyphenated versions of multiple fields, such as _ID value, patient names, and providers, a compound shard key can help reduce query bottlenecks and latency. Ultimately, sharding is a valuable tool for any developer, as well as a cost-effective way to scale out your database capacity. Although it may seem complicated in practice, sharding (and working effectively with sharded data) can be very intuitive with MongoDB. To learn more about sharding — and to see how you can set it up in your own environment — contact the MongoDB Professional Services team today.

January 3, 2023

Tackling the 5G Complexity Beast with MongoDB’s Developer Data Platform Simplicity

The advent and commercialization of 5G is driving a sea change in the mobile user experience. This success is evidenced by the booming adoption of 5G-enabled devices. Supporting real-time business, streaming, and gaming applications on a 5G network is essential for telecommunications companies’ enterprise growth but demanding on the systems that support them. As the “cloudification” of network functions continues to evolve, it grows more challenging for older business support systems (BSS) and operations support systems (OSS) to keep up. To address the needs of increasingly complex networks, operators are reevaluating their data strategy by recognizing that a developer-focused data platform, to address the needs of mission critical systems, can enable a greater level of agility across the enterprise. This is the thesis of a new IDC white paper, sponsored by MongoDB, Effective Data Management is Essential for Taming the 5G Network Complexity Beast (doc #US49660722, September 2022). In the analysis, led by Karl Whitelock, Research Vice President, Communication Service Provider - Operations & Monetization, IDC examines the new generation of services that will drive innovation in multiple industries, and reviews solutions for the challenges telecommunications providers will face amid new operations and monetization strategies derived from 5G and mobile edge computing services. Take me straight to the IDC White Paper: Effective Data Management is Essential for Taming the 5G Network Complexity Beast Building business solutions at the network edge As software-driven 5G services evolve through a cloud-native network architecture, complexity grows. Within the multi-technology network, an advancing web of systems connects data from the mobile network to an edge cloud, HCP cloud, the core network, the internet, and back again. To manage this complexity, network automation and extensive data analytics capabilities are key components in delivering a first-class customer experience. The new generation of digital services is 5G enabled. IDC is witnessing demand from social media, streaming video, search, gaming, transport, and industrial internet IoT applications building network traffic, and associated data, at soaring rates. Businesses across diverse industries are jumping on the 5G bandwagon. The business solutions being dreamed up by developers are redefining services and business outcomes, particularly when utilizing delivery at the network edge. For example: Manufacturing Private 5G networks help high-speed production facilities identify defects and remove incorrectly assembled equipment. Architecture/Construction Robots measure architectural layouts and site dimensions are collected during construction. Records are stored in the cloud for later access by inspectors, builders, and customers. Sporting Events Edge computing can be faster and more reliable for processing data at large scale sporting events. This allows organizers to collect and process data to build interactive digital experiences at the edge.

December 7, 2022

MongoDB Named as a Leader in The Forrester Wave™: Translytical Data Platforms, Q4 2022

In The Forrester Wave™: Translytical Data Platforms, Q4 2022, translytical data platforms are described by Forrester as being “designed to support transactional, operational, and analytical workloads without sacrificing data integrity, performance, and analytics scale.” Characterized as next-generation data platforms, the Forrester report further notes that “Adoption of these platforms continues to grow strongly to support new and emerging business cases, including real-time integrated insights, scalable microservices, machine learning (ML), streaming analytics, and extreme transaction processing.” To help users understand this emerging technology landscape, Forrester published its previous Translytical Data Platforms Wave back in 2019. Three years on, Forrester has named MongoDB as a Leader in its latest Translytical Data Platforms Wave. We believe MongoDB was named a Leader in this report due to the R&D investments made in further building out capabilities in MongoDB Atlas , our multi-cloud developer data platform. These investments were driven by the demands of the developer communities we work with day-in, day-out. You told us how you struggle to bring together all of the data infrastructure needed to power modern digital experiences – from transactional databases to analytics processing, full-text search, and streaming. This is exactly what our developer data platform offers. It provides an elegant, integrated, and fully-managed data architecture accessed via a unified set of APIs. With MongoDB Atlas, developers are more productive, they ship code faster and improve it more frequently. Translytics and the Rise of Application-Driven Analytics Translytics is part of an important shift that we at MongoDB call application-driven analytics . By building smarter apps and increasing the speed of business insights, application-driven analytics gives you the opportunity to out-innovate your competitors and improve efficiency. To do this you can no longer rely only on copying data out of operational systems into separate analytics stores. Moving data takes time and creates too much separation between application events and actions. Instead, analytics processing has to be “shifted left” to the source of your data – to the applications themselves. This is the shift MongoDB calls application-driven analytics . It’s a shift that impacts both the skills and the technologies developers and analytics teams use every day. This is why understanding the technology landscape is so important. Overall, MongoDB is good for customers that are driving their strategy around developers who are tasked with building analytics into their applications. The Forrester Wave™: Translytical Data Platforms, Q4 2022 Evaluating the top vendors in the Translytic Data Platforms Wave Forrester evaluated 15 of the most significant translytical data platform vendors against 26 criteria. These criteria span current offering and strategy through to market presence. Forrester gave MongoDB the highest possible scores across eleven criteria, including: Number of customers Performance Scalability Dev Tools/API Multi-model Streaming Cloud / On-prem / distributed architecture Commercial model The report cites that “MongoDB ramps up its translytical offering aggressively”, and that “Organizations use MongoDB to support real-time analytics, systems of insight, customer 360, internet of things (IoT), and mobile applications.” Access your complimentary copy of the report here . Customer Momentum Many development teams start out using MongoDB as an operational database for both new cloud-native services as well as modernized legacy apps. More and more of these teams are now improving customer experience and speeding business insight by adopting application-driven analytics. Examples include: Bosch for predictive maintenance using IoT sensor data. Keller Williams for relevance-based property search and sales dashboarding. Iron Mountain for AI-based information discovery and intelligence. Volvo Connect for fleet management. Getting started on your Translytics Journey The MongoDB Atlas developer data platform is engineered to help you make the shift to Translytics and application-driven analytics – leading to smarter apps and increased business visibility. The best way to get started is to sign up for an account on MongoDB Atlas . Then create a free database cluster, load your own data or our sample data sets, and explore what’s possible within the platform. The MongoDB Developer Center hosts an array of resources including tutorials, sample code, videos, and documentation organized by programming language and product. Whether you are a developer or a member of an analytics team, it's never been easier to get started enriching your transactional workloads with analytics!

November 29, 2022

Modernize your GraphQL APIs with MongoDB Atlas and AWS AppSync

Modern applications typically need data from a variety of data sources, which are frequently backed by different databases and fronted by a multitude of REST APIs. Consolidating the data into a single coherent API presents a significant challenge for application developers. GraphQL emerged as a leading data query and manipulation language to simplify consolidating various APIs. GraphQL provides a complete and understandable description of the data in your API, giving clients the power to ask for exactly what they need — while making it easier to evolve APIs over time. It complements popular development stacks like MEAN and MERN , aggregating data from multiple origins into a single source that applications can then easily interact with. MongoDB Atlas: A modern developer data platform MongoDB Atlas is a modern developer data platform with a fully managed cloud database at its core. It provides rich features like native time series collections, geospatial data, multi-level indexing, search, isolated workloads, and many more — all built on top of the flexible MongoDB document data model. MongoDB Atlas App Services help developers build apps, integrate services, and connect to their data by reducing operational overhead through features such as hosted Data API and GraphQL API. The Atlas Data API allows developers to easily integrate Atlas data into their cloud apps and services over HTTPS with a flexible, REST-like API layer. The Atlas GraphQL API lets developers access Atlas data from any standard GraphQL client with an API that generates based on your data’s schema. AWS AppSync: Serverless GraphQL and pub/sub APIs AWS AppSync is an AWS managed service that allows developers to build GraphQL and Pub/Sub APIs. With AWS AppSync, developers can create APIs that access data from one or many sources and enable real-time interactions in their applications. The resulting APIs are serverless, automatically scale to meet the throughput and latency requirements of the most demanding applications, and charge only for requests to the API and by real-time messages delivered. Exposing your MongoDB Data over a scalable GraphQL API with AWS AppSync Together, AWS AppSync and MongoDB Atlas help developers create GraphQL APIs by integrating multiple REST APIs and data sources on AWS. This gives frontend developers a single GraphQL API data source to drive their applications. Compared to REST APIs, developers get flexibility in defining the structure of the data while reducing the payload size by bringing only the attributes that are required. Additionally, developers are able to take advantage of other AWS services such as Amazon Cognito, AWS Amplify, Amazon API Gateway, and AWS Lambda when building modern applications. This allows for a severless end-to-end architecture, which is backed by MongoDB Atlas serverless instances and available in pay-as-you-go mode from the AWS Marketplace . Paths to integration AWS AppSync uses data sources and resolvers to translate GraphQL requests and to retrieve data; for example, users can fetch MongoDB Atlas data using AppSync Direct Lambda Resolvers. Below, we explore two approaches to implementing Lambda Resolvers: using the Atlas Data API or connecting directly via MongoDB drivers . Using the Atlas Data API in a Direct Lambda Resolver With this approach, developers leverage the pre-created Atlas Data API when building a Direct Lambda Resolver. This ready-made API acts as a data source in the resolver, and supports popular authentication mechanisms based on API Keys, JWT, or email-password. This enables seamless integration with Amazon Cognito to manage customer identity and access. The Atlas Data API lets you read and write data in Atlas using standard HTTPS requests and comes with managed networking and connections, replacing your typical app server. Any runtime capable of making HTTPS calls is compatible with the API. Figure 1:   Architecture details of Direct Lambda Resolver with Data API Figure 1 shows how AWS AppSync leverages the AWS Lambda Direct Resolver to connect to the MongoDB Atlas Data API. The Atlas Data API then interacts with your Atlas Cluster to retrieve and store the data. MongoDB driver-based Direct Lambda Resolver With this option, the Lambda Resolver connects to MongoDB Atlas directly via drivers , which are available in multiple programming languages and provide idiomatic access to MongoDB. MongoDB drivers support a rich set of functionality and options , including the MongoDB Query Language, write and read concerns, and more. Figure 2:   Details the architecture of Direct Lambda Resolvers through native MongoDB drivers Figure 2 shows how the AWS AppSync endpoint leverages Lambda Resolvers to connect to MongoDB Atlas. The Lambda function uses a MongoDB driver to make a direct connection to the Atlas cluster, and to retrieve and store data. The table below summarizes the different resolver implementation approaches. Table 1:   Feature comparison of resolver implementations Setup Atlas Cluster Set up a free cluster in MongoDB Atlas. Configure the database for network security and access. Set up the Data API. Secret Manager Create the AWS Secret Manager to securely store database credentials. Lambda Function Create Lambda functions with the MongoDB Data APIs or MongoDB drivers as shown in this Github tutorial . AWS AppSync setup Set up AWS Appsync to configure the data source and query. Test API Test the AWS AppSync APIs using the AWS Console or Postman . Figure 3:   Test results for the AWS AppSync query Conclusion To learn more, refer to the AppSync Atlas Integration GitHub repository for step-by-step instructions and sample code. This solution can be extended to AWS Amplify for building mobile applications. For further information, please contact .

November 23, 2022

MongoDB and AWS: Simplifying OSDU Metadata Management

In this decade of the 2020s, the energy sector is experiencing two major changes at the same time: The transition from fossil to renewables, and the digital transformation that changes the way businesses operate through better applications and tools that help streamline and automate processes. To support both of these challenges, the Open Group OSDU Forum has created a new data platform standard for the energy industry that seeks to reduce data silos and enable transformational workflows via an open, standards-based API set and supporting ecosystem. OSDU (Open Subsurface Data Universe) is an industry-defining initiative that provides a unified approach to store and retrieve data in a standardized way in order to allow reductions in infrastructure cost, simplify the integration of separate business areas, and adopt new energy verticals within the same architectural principles. Amazon Web Services (AWS) — as an early supporter of OSDU — provides a premier, cloud-first offering available across more than 87 availability zones and 27 regions. MongoDB — an OSDU member since 2019 — and AWS are collaborating to leverage MongoDB as part of the AWS OSDU platform for added flexibility and to provide a robust multi-region OSDU offering to major customers. Why MongoDB for OSDU? OSDU provides a unique challenge, as its architecture is set to support a varied data set originating from the oil and gas industry, while also being extensible enough to support the expanding requirements of new energy and renewables. It must be able to support single-use on a laptop for beginning practitioners, yet scale to the needs of experts with varying deployment scenarios — from on-premises, in-field, and cloud — and from single tenant on one region to multi-region and multi-tenant applications. Furthermore, OSDU architectural principles separate raw object data from the metadata that describes it, which puts an additional burden on the flexibility needed to manage OSDU metadata, while supporting all the above requirements. Enter MongoDB Since 2008, MongoDB has championed the use of the document model as the data store that supports a flexible JSON-type structure, which can be considered a superset of different existing data types — from tabular, key-value, and text to geo-spatial, graph, and time series. Thus, MongoDB has the flexibility not only to support just the main metadata services in OSDU but also to adapt to the needs of domain-specific services as OSDU evolves. The flexibility of MongoDB allows users to model and query the data in a variety of ways within the same architecture without the need to proliferate disparate databases for each specific data type, which incurs overhead both in terms of deployment, cost and scale, and the ability to query. The schema flexibility inherent in this document model allows developers to adapt and make changes quickly, without the operational burden that comes with schema changes with traditional tabular databases. MongoDB can also scale from the smallest environment to massive, multi-region deployments, with cross-regional data replication support that is available today across more than 90 regions with MongoDB Atlas . With the addition of MongoDB’s cluster-to-cluster sync , MongoDB can easily support hybrid deployments bridging on-premises or edge to the cloud, a requirement that is increasingly important for energy supermajors or for regions where data sovereignty is paramount. Example: LegalTag An example of the benefit of MongoDB’s document model is OSDU’s LegalTag Compliance Service , which governs the legal status of data in the OSDU data ecosystem. It is a collection of JSON properties that governs how the data can be consumed and ingested. With MongoDB, the properties are directly stored, indexed, and made available to be queried — even via full-text search for more advanced use cases. The schema flexibility simplifies integrating additional derived data from ingested data sources, which is utilized for the further enrichment of the LegalTag metadata. Here the JSON document can accommodate more nodes to integrate this data without the need for new tables and data structures that need to be created and managed. AWS OSDU with MongoDB MongoDB and AWS collaborated to provide a MongoDB-based metadata implementation (Figure 1), which is available for all main OSDU services: Partition, Entitlements, Legal, Schema, Storage. The AWS default ODSU Partition service leverages MongoDB due to its simple replication capabilities (auto-deployable via CloudFormation, Terraform, and Kubernetes), which simplify identifying the correct connection information at runtime to the correct OSDU partition in a multi-region and multi-cluster deployment. The OSDU Entitlements service manages authorization and permissions for access to OSDU services and its data-using groups. The most recent OSDU reference implementation for Entitlements leverages a graph model to manage the relationship between groups, members, and owners. Thus, AWS again chose MongoDB with its inherent graph capabilities through the document model to simplify the implementation without the need to integrate a further dedicated database technology into the architecture. Figure 1:   MongoDB metadata service options with AWS OSDU. Other potential benefits for OSDU MongoDB also offers workload isolation , which provides the ability to dedicate instances only for reporting workloads against the operational dataset. This provides the ability to create real-time observability of the system based on the activity on metadata. Triggers and aggregation pipelines allow the creation of an alternate view of activity in real-time, which can easily be visualized via MongoDB Charts (part of Atlas) without the need for a dedicated visualization system. Flexibility and consistency A major use case for both the energy industry and the direction of OSDU is the ability to capture and preprocess data closest to where it originated. For remote locations where direct connections to the cloud are prohibitive, this approach is often the only option — think Arctic or off-shore locations. Additionally, certain countries have data sovereignty laws that require an alternative deployment option outside of the public cloud. A MongoDB-based OSDU implementation can provide a distinct advantage, as MongoDB as a data platform itself supports deployment in the field (e.g., off-shore), on-premises, in private cloud (e.g., Kubernetes, Terraform), public cloud (e.g., AWS) and as a SaaS implementation (e.g., Atlas). Adoption of MongoDB for OSDU provides consistency across different deployment/cloud scenarios, thereby reducing the overhead for managing and operating a disparate set of technologies where multiple scenarios are required. Conclusion OSDU has been created to change the way data is collected and shared across the oil and gas and energy industry. Its intent is to accelerate digital transformation within the industry. The range of use cases and deployment scenarios requires a solution that provides flexibility in the supported datasets, flexibility for the developer to innovate without additional schema and operational burden, as well as flexibility to be deployable in various environments. Through the collaboration of AWS and MongoDB, there is an additional metadata storage option available for OSDU that provides a modern technology stack with the performance and scalability for the most demanding scenario in the energy industry. 1. MongoDB Atlas 2. MongoDB Edge Computing 3. OSDU Data platform on AWS

November 22, 2022

Manage and Store Data Where You Want with MongoDB

Increasingly, data is stored in a public cloud as companies realize the agility and cost benefits of running on cloud infrastructure. At any given time, however, organizations must know where their data is located, replicated, and stored — as well as how it is collected and processed to constantly ensure personal data privacy. Creating a proper structure for storing your data just where you want it can be complex, especially with the shift towards geographically dispersed data and the need to comply with local and regional privacy and data security requirements. Organizations without a strong handle on where their data is stored potentially risk millions of dollars in regulatory fines for mishandling data, loss of brand credibility, and distrust from customers. Geographically dispersed data and various compliance regulations also impact how organizations design their applications, and many see these challenges as an opportunity to transform how they engage with data. For example, organizations get the benefits of a multi-cloud strategy and avoid vendor lock-in, knowing that they can still run on-premises or on a different cloud provider. However, a flexible data model is needed to keep data within the confines of the country or region where the data originates. MongoDB runs where you want your data to be — on-premises, in the cloud, or as an on-demand, fully managed global cloud database. In this article, we’ll look at ways MongoDB can help you keep your data exactly where you need it. Major considerations for managing data When managing data, organizations must answer questions in several key areas, including: Process: How is your company going to scale security practices and automate compliance for the most prevalent data security and privacy regulatory frameworks? Penalties: Are your business leaders fully aware of the costs associated with not adhering to regulations when storing and managing your data? Scalability: Do you have an application that you anticipate will grow in the future and can scale automatically as demand requires? Infrastructure: Is legacy infrastructure keeping you from being able to easily comply with data regulations? Flexibility: Is your data architecture agile enough to meet regulations quickly as they grow in breadth and complexity? Cost: Are you wasting time and money with manual processes when adhering to regulations and risking hefty fines related to noncompliance? How companies use MongoDB to store data where they want and need it When storing and managing data in different regions and countries, organizations must also understand the rules and regulations that apply. MongoDB is uniquely positioned to support organizations to meet their data goals with intuitive security features and privacy controls, as well as the ability to geographically deploy data clusters and backups in one or several regions. Zones in sharded clusters MongoDB uses sharding to support deployments with very large data sets and high-throughput operations. In sharded clusters, you can create zones of sharded data based on the shard key, which helps improve the locality of data. Network isolation and access Each MongoDB Atlas project is provisioned into its own virtual private cloud (VPC), thereby isolating your data and underlying systems from other MongoDB Atlas users. This approach allows businesses to meet data requirements while staying highly available within each region. Each shard of data will have multiple nodes that automatically and transparently fail over for zero downtime, all within the same region. Multi-cloud clusters MongoDB Atlas is the only globally distributed, multi-cloud database. It lets you deploy a single cluster across AWS, Microsoft Azure, and Google Cloud without the operational complexity of managing data replication and migration across clouds. With the ability to define a geographic location for each document, your teams can also keep relevant data close to end users for regulatory compliance. IP whitelists IP whitelists allow you to specify a specific range of IP addresses against which access will be granted, delivering granular control over data. Queryable encryption Queryable encryption enables encryption of sensitive data from the client side, stored as fully randomized, encrypted data on the database server side. This feature delivers the utmost in security without sacrificing performance and is available on both MongoDB Atlas and Enterprise Advanced. MongoDB Atlas global clusters Atlas global clusters allow organizations with distributed applications to geographically partition a fully managed deployment in a few clicks and control the distribution and placement of their data with sophisticated policies that can be easily generated and changed. Thus, your organization can not only achieve compliance with local data protection regulations more easily but also reduce overhead. Client-Side Field Level Encryption MongoDB’s Client-Side Field Level Encryption (FLE) dramatically reduces the risk of unauthorized access or disclosure of sensitive data. Fields are encrypted before they leave your application, protecting them everywhere — in motion over the network, in database memory, at rest in storage and backups, and in system logs. Segmenting data by location with sharded clusters As your application gets more popular, you may reach a point where your servers will reach their maximum load. Before you reach that point, you must plan for scaling your database to adjust resources to meet demand. Scaling can occur temporarily, with a sudden burst of traffic, or permanently with a constant increase in the popularity of your services. Increased usage of your application brings three main challenges to your database server: The CPU and/or memory becomes overloaded, and the database server either cannot respond to all the request throughput, or do so in a reasonable amount of time. Your database server runs out of storage and thus cannot store all the data. Your network interface is overloaded, so it cannot support all the network traffic received. When your system resource limits are reached, you will want to consider scaling your database. Horizontal scaling refers to bringing on additional nodes to share the load. This process is difficult with relational databases because of the difficulty in spreading out related data across nodes. With non-relational databases, this is made simpler because collections are self-contained and not coupled relationally. This approach allows them to be distributed across nodes more simply, as queries do not have to “join” them together across nodes. Horizontal scaling with MongoDB Atlas is achieved through sharding. With sharded clusters, you can create zones of sharded data based on the shard key . You can associate each zone with one or more shards in the cluster. A shard can be associated with any number of zones. In a balanced cluster, MongoDB migrates chunks covered by a zone only to those shards associated with the zone: If one of the data centers goes down, the data is still available for reads unlike a single data center distribution. If the data center with a minority of the members goes down, the replica set can still serve write operations as well as read operations. However, if the data center with the majority of the members goes down, the replica set becomes read-only. Figure 1 illustrates a sharded cluster that uses geographic zones to manage and satisfy data segmentation requirements. Figure 1:   Sharded cluster Other benefits of MongoDB Atlas MongoDB Atlas also provides organizations with an intuitive UI or administration API to efficiently perform tasks that would otherwise be very difficult. Upgrading your servers or setting up sharding without having to shut down your servers can be a challenge, but MongoDB Atlas removes this layer of difficulty through the features described here. With MongoDB, scaling your databases can be done with a couple of clicks. Meeting your data goals with MongoDB Organizations are uniquely positioned to store and manage data where they want it with MongoDB’s range of features discussed above. With the shift towards geographically dispersed data, organizations must make sure they are aware of – and fully understand – the local and regional rules and requirements that apply for storing and managing data. To learn more about how MongoDB can help you meet your data goals, check out the following resources: MongoDB Atlas security, with built-in security controls for all your data Entrust MongoDB Cloud Services with sensitive application and user data Scalability with MongoDB Atlas

November 22, 2022

Optimizing Your MongoDB Deployment with Performance Advisor

We are happy to announce additional enhancements to MongoDB’s Performance Advisor, now available in MongoDB Atlas , MongoDB Cloud Manager , and MongoDB Ops Manager . MongoDB’s Performance Advisor automatically analyzes logs for slow-running queries and provides index suggestions to improve query performance. In this latest update, we’ve made some key updates, including: A new ranking algorithm and additional performance statistics (e.g., average documents scanned, average documents returned, and average object size) make it easier to understand the relative importance of each index recommendation. Support for additional query types including regexes, negation operators (e.g., $ne, $nin, $not), $count, $distinct, and $match to ensure we cover with optimized index suggestions. Index recommendations are now more deterministic so they are less impacted by time and provide more consistent query performance benefits. Before diving further into MongoDB’s Performance Advisor, let’s look at tools MongoDB provides out of the box to simplify database monitoring. Background Deploying your MongoDB cluster and getting your database running is a critical first step, but another important aspect of managing your database is ensuring that your database is performant and running efficiently. To make this easier for you, MongoDB offers several out-of-the-box monitoring tools , such as the Query Profiler, Performance Advisor, Real-Time Performance Panel, and Metrics Charts, to name a few. Suppose you notice that your database queries are running slower. The first place you might go is to the metrics charts to look at the “Opcounters” metrics to see whether you have more operations running. You might also look at the “Operation Execution Time” to see if your queries are taking longer to run. The “Query Targeting” metric shows the ratio of the number of documents scanned over the number of documents returned. This datapoint is a great measure of the overall efficiency of a query — the higher the ratio, the less efficient the query. These and other metrics can help you identify performance issues with your overall cluster, which you can then use as context to dive a level deeper and perform more targeted diagnostics of individual slow-running queries . MongoDB’s Performance Advisor takes this functionality a step further by automatically scanning your slowest queries and recommending indexes where appropriate to improve query performance. Getting started with Performance Advisor The Performance Advisor is a unique tool that automatically monitors MongoDB logs for slow-running queries and suggests indexes to improve query performance. Performance Advisor also helps improve both your read and write performance by intelligently recommending indexes to create and/or drop (Figure 1). These suggestions are ranked by the determined impact on your cluster. Performance Advisor is available on M10 and above clusters in MongoDB Atlas as well as in Cloud Manager and Ops Manager. Figure 1:  Performance Advisor can recommend indexes to create or drop. Performance Advisor will suggest which indexes to create, what queries will be affected by the index, and the expected improvements to query performance. All of these user interactions are available in the user interface directly within Performance Advisor, and indexes can be easily created with just a few clicks. Figure 2 shows additional Performance Advisor statistics about the performance improvements this index would provide. The performance statistics that are highlighted for each index recommendation include: Execution Count: The number of queries per hour that would be covered by the recommended index Avg Execution Time: The average execution time of queries that would be covered by the recommended index Avg Query Targeting: The inefficiency of queries that would be covered by the recommended index, measured by the number of documents or index keys scanned in order to return one document In Memory Sort: The number of in-memory sorts performed per hour for queries that would be covered by the recommended index Avg Docs Scanned: The average number of documents that were scanned by slow queries with this query shape Avg Docs Returned: The average number of documents that were returned by slow queries with this query shape Avg Object Size: The average object size of all objects in the impacted collection If you have multiple index recommendations, they are ranked by their relative impact to query performance so that the most beneficial index suggestion is displayed at the top. This means that the most impactful index is displayed at the top and would be the most beneficial to query performance. Figure 2:  Detailed performance statistics. Creating optimal indexes ensures that queries are not scanning more documents than they return. However, creating too many indexes can slow down write performance, as each write operation needs to check each index before writing. Performance Advisor provides suggestions on which indexes to drop based on whether they are unused or redundant (Figure 3). Users also have the option to “hide” indexes as a way to evaluate the impact of dropping an index without actually dropping the index. Figure 3: Performance Advisor shows which indexes are unused or redundant. The Performance Advisor in MongoDB provides a simple and cost-efficient way to ensure you’re getting the best performance out of your MongoDB database. If you’d like to see the Performance Advisor in action, the easiest way to get started is to sign up for MongoDB Atlas , our cloud database service. Performance Advisor is available on MongoDB Atlas on M10 cluster tiers and higher. Learn more from the following resources: Monitor and Improve Slow Queries Monitor Your Database Deployments

November 22, 2022

Start on Your Journey to Operationalize AI-Enhanced Real-Time Applications with MongoDB and Databricks

MongoDB and Databricks have succeeded in two complementary worlds: For MongoDB , the focus is making the world of data easy for developers building applications. For Databricks, the focus is helping enterprises to unify their data, analytics, and AI by combining a data lake's flexibility with the openness, performance, and governance of a data warehouse. Traditionally, these operational and analytical functions have existed in separate domains built by different teams and serving different audiences. Though some will pretend a data warehouse can unify such disparate data and systems, the reality is this approach leaves you making false trade-offs where your developers, your data scientists, and, ultimately, your applications and customers suffer. Data warehouses are not designed to serve consumer-facing applications at scale and process machine learning in real time. It takes the unique application-serving layer of a MongoDB database, combined with the scale and real-time capabilities of a lakehouse, such as Databricks, to automate and operationalize complex and AI-enhanced applications at scale. We observed that a large and growing population of joint customers has for years enabled the flow of data between our two platforms to run real-time businesses and enable a world of application-driven analytics, using MongoDB Connector for Apache Spark . So we asked ourselves: How could we make that a more seamless and elegant experience for these customers? Today we're announcing that Databricks now features MongoDB as a data source within a Databricks notebook , thereby enabling data practitioners with an easier, more curated experience for connecting Databricks with MongoDB Atlas data. This notebook experience makes it simpler for enterprises to deliver real-time analytics, handle complex data warehouse/BI workloads, and to operationalize AI/ML pipelines using the MongoDB Spark Connector . In turn, developer and data teams can collaborate more closely on building a new generation of app-driven intelligence. MongoDB and Databricks are committed to further improve our integration in the coming months. In this post, we'll explain how Databricks can be used as a real-time processing layer for data on MongoDB Atlas using the Spark Connector, extending MongoDB's built-in data processing capabilities like our aggregation framework . We'll also cover how to use Databricks' MongoDB notebook to make this even easier. In future posts we'll outline how to use MongoDB Atlas and Databricks Delta Lake to build sophisticated AI/ML pipelines. Live application data plus the data lakehouse MongoDB Atlas is a fully-managed developer data platform that powers a wide variety of workloads - supporting everything from simple CRUD operations to sophisticated data processing pipelines for analytics and transformation - all with a common query interface. With MongoDB Atlas you can isolate operational and analytical workloads using dedicated analytical nodes. Analytics nodes are read-only nodes that can be exclusively targeted by your queries Let's look at an example. Assume you have long-running analytical queries that you want to run against your cluster and your operations team does not want these queries competing for resources with your regular operational workload. To address this, you add an analytics node to your cluster and then target it in your connection string using an Atlas replica set tag. You can connect to the analytical nodes to run sophisticated aggregation queries, BI and reporting workloads using the Atlas SQL interface , visualize your data using MongoDB Charts , or run Spark jobs using MongoDB’s Spark Connector. For more complex data science and warehousing analytical queries, many enterprises choose the Databricks Lakehouse Platform . Enterprises can also benefit from enriching MongoDB data with data from other internal or external sources in the Databricks Lakehouse. The Databricks Lakehouse Platform combines the best elements of data lakes and data warehouses to deliver the reliability, strong governance, and performance of data warehouses with the openness, flexibility, and machine learning support of data lakes. This unified approach simplifies your modern data stack by eliminating the data silos that traditionally separate and complicate data engineering, analytics, BI, data science, and machine learning. With Databricks notebooks, developers and analytics teams can collaboratively write code in Python, R, Scala, and SQL, plus explore data with interactive visualizations and discover new insights. You can confidently and securely share code with co-authoring, commenting, automatic versioning, Git integrations, and role-based access controls. As good as MongoDB and Databricks are on their own, together we offer enterprises the unmatched ability to work with live application data across traditionally separate domains. This ability allows your teams to deliver what we call application-driven analytics . How does this work? Using MongoDB and Databricks together MongoDB and Databricks offer several ways to integrate the two systems, but the primary means is MongoDB’s Spark Connector. The Spark connector can be used within Databricks notebooks to directly query live application data managed in MongoDB collections and then loaded into data frames for further processing. You can also transform and/or enrich this data with data ingested from other sources using SparkSQL. Queries issued by the Spark Connector can be pushed down to MongoDB's aggregation framework and indexes for pre-processing, significantly improving query efficiency (measured in milliseconds not seconds or minutes). Result sets generated from the Databricks notebooks can then be inserted back into MongoDB collections or can be pushed into Delta Lake for long-running analytics and machine learning. Easier integration using Databricks' MongoDB Notebook A Databricks notebook is a web-based interface that contains runnable code, visualizations, and explanatory text in the form of paragraphs. It lets personas, such as data scientists and data engineers, build linked sets of code in different languages and visualize results in a format in which they are used to working. Notebooks are great for collaboration and can be easily iterated on and improved. MongoDB and Databricks created an example notebook that has sample code for: Reading the data from MongoDB Atlas collections as is into Spark dataframes. Pre-processing and filtering the data from Atlas collections using the aggregation framework, before passing into Spark dataframes. Enriching/transforming the data using SparkSQL Writing the enriched data back to the MongoDB Atlas collection. Figure 1:   Screenshot of data sources in a Databricks notebook. This notebook can help as an initial template for developers to start building complex transformation jobs on MongoDB data with Databricks platform. Interested in a practical example of how this works? Let's demonstrate how you can run analytics on a sample sales dataset using MongoDB's aggregation framework and visualize it with Charts. The example also explains how you can enrich this data using our Databricks notebook and load that back to MongoDB. Refer to the GitHub repo for the same. Figure 2:   Ways to integrate MongoDB and the Databricks Lakehouse Platform. In addition to Spark, MongoDB and Databricks provide seamless integration through shared Cloud Object stores to enable a more traditional data exchange using analytics-optimized formats such as Parquet, as well as event streaming integration using Apache Kafka. Together, MongoDB and Databricks offer unparalleled abilities to unify and process data from disparate systems in real-time. And now with the newly announced Databricks notebooks integration, data engineers and data scientists have an even easier and more intuitive interface to harness MongoDB data for their most sophisticated analytics and AI processing, making real-time applications more intelligent. Conclusion MongoDB Atlas along with Databricks Platform together will help organizations handle the increasing convergence between operational and analytical workloads. This convergence enables application-driven analytics and will help you build smarter applications and derive the right insights in real-time. Reach out to to learn more.

November 21, 2022