Announcing Atlas Data Federation and Atlas Data Lake
Two years ago, we released the first iteration of Atlas Data Lake . Since then, we’ve helped customers combine data from various storage layers to feed downstream systems. But after years spent studying our customers’ experiences, we realized we hadn’t gone far enough. To truly unleash the genius in all our developers, we needed to add an economical cloud object storage solution with a rich MQL query experience to the world of Atlas. Today, we’re thrilled to announce that our new Atlas Data Federation and Atlas Data Lake offerings do just that. We now offer two complementary services, Atlas Data Federation (our existing query service formerly known as Atlas Data Lake) and our new and improved Atlas Data Lake (a fully managed analytic-oriented storage service). Together, these services (both in preview) provide flexible and versatile options for querying and transforming data across storage services, as well as a MongoDB-native analytic storage solution. With these tools, you can query across multiple clusters, move data into self managed cloud object storage for consumption by downstream services, query a workload-isolated inexpensive copy of cluster data, compare your cluster data across different points in time, and much, much more. In hearing from our customers about their experiences with Atlas Data Lake, we learned where they have struggled, as well as the features they’ve been looking for us to provide. With this in mind, we decided to shift the name of our current query federation service to Atlas Data Federation to better align with how customers see the service and are getting value. We’ve seen many customers benefit from the flexibility of a federated query engine service, including querying data across multiple clusters, databases, and collections, as well as exporting data to third-party systems. We also saw where our customers were struggling with data lakes. We heard them ask for a fully managed storage solution so they could achieve all of their analytic goals within Atlas. Specifically, customers wanted scalable storage that would provide high query performance at a low cost. Our new Data Lake provides a high-performance analytic object storage solution, allowing customers to query historical data with no additional formatting or maintenance work needed on their end. How it works Atlas Data Federation encompasses our existing Data Lake functionality with several new affordances. It continues to deliver the same power that it always has, with increased performance and efficiency. The new Atlas Data Lake will now allow you to create Data Lake pipelines (based on your Atlas Cluster backup schedules) and fields on which you can optimize queries. The service takes the following steps: On the selected schedule, a copy of your collection will be extracted from your Atlas backup with no impact to your cluster. During extraction, we build partition indexes based on the contents of your documents and the fields you’ve selected for optimization. These indexes allow your queries to be optimized by capturing the minimums and maximums (and other stats) of the records in each partition, letting you quickly find the relevant data for your queries. Finally, the underlying data lands in an analytic-oriented format inside of cloud object storage. This minimizes data scanned when you execute a query. Once a pipeline has run and a Data Lake dataset has been created, you can select it as a data source in our new Data Federation query experience. You can either set it as the source for a specific virtual collection in a Federated Database Instance or you can have your Federated Database Instance generate a collection name for each dataset that your pipeline has created. Amazingly, no part of this process will consume compute resources from your cluster — neither the export nor the querying of datasets. These datasets provide workload isolation and consistency for long-running analytic queries, a target for ETL jobs using the powerful $out to S3. This makes it easy to compare the state of your data over time. Advanced though this is, it’s only the beginning of the story. We’re committing to evolving the service, improving performance, adding more sources of data, and building new features. All of this will be based on the feedback you, the user, gives us. We can’t wait to see how you’ll use this powerful new tool and can’t wait to hear what you’d like to see next. Try Atlas Data Lake Today
Visualize Blended Atlas and AWS S3 Data From Atlas Data Lake with MongoDB Charts
As of June 2022, the functionality previously known as Atlas Data Lake is now named Atlas Data Federation. Atlas Data Federation’s functionality is unchanged and you can learn more about it here . Atlas Data Lake will remain in the Atlas Platform, with newly introduced functionality that you can learn about here . We’re excited to announce that MongoDB Charts supports Atlas Data Lake as a data source! You can now use Charts to easily visualize data stored across different Atlas databases and AWS S3 buckets. Thanks to the aggregating power of Atlas Data Lake’s federated query, creating charts and graphs from blended application and cloud object data is simpler than ever before. On the surface this powerful integration is as simple as adding your Atlas Data Lake as a data source within Charts. However, it unlocks a deeper level of analysis while eliminating the need for creating an Extract-Transform-Load (ETL) process across your Atlas and S3 data. The integration provides the ability to visualize data from the following combination of sources without writing any code: Data from many Atlas databases or clusters, including multi-cloud clusters Cloud storage data from AWS S3 Blended Atlas and cloud storage (AWS S3) data Scenario: Finding insights from aggregated customer profile and contract data Let’s add a real world scenario of how this can enhance the analytics you derive from your data. While doing so, we will walk through the steps of setting up your Atlas Data Lake, adding it as a data source to Charts, and getting the most of your data with Charts’ powerful visualization capabilities. For context, let’s imagine we’re an analyst at a telecom company and we have contract data that is stored in MongoDB Atlas in different clusters and databases for each country we operate in - United States and Canada. Second, we have offloaded data from our Customer Relationship Management (CRM) tool as a parquet file into an AWS S3 bucket. All three datasets share a common “customerID” field. Configure Atlas Data Lake Because both “contracts” collections (or datasets) in MongoDB Atlas share the same fields, I simply mapped both into a single collection within the data lake. I mapped the customer profiles dataset into its own collection, since it only shares the “customerID” field. However, now that it’s in the same data lake, I will easily be able to join it to my contract data with a $lookup in my Charts aggregation pipeline or with a Lookup Field in the chart builder. (A $lookup in the MongoDB Query API is equivalent to a join in SQL.) Configure Charts data source I want to find insights from all contracts, both US and Canada in this scenario. Once I have created a single Atlas Data Lake collection (DL_contracts.allcontracts) from the two separate databases, I then need to add it as a data source in Charts. Simply click on “add data source” within Charts and add your data lake, and then choose the collections we want to use in the next step. For completeness I also added the two Atlas collections (US and Canada contracts) as data sources in Charts by following the same steps. Visualize data across multiple Atlas databases With Atlas Data Lake’s federated query capability, which effectively performs a union of data, I am able to build a column chart that shows the amount of all US and CA contracts in a single chart without writing any code. As you can see below, the chart shows both US and CA columns when connected to the data lake collection. When the data source is switched directly to either Atlas database, it only shows data for that respective database, or country in this example. Visualize blended data from Atlas and an AWS S3 bucket Lastly, let’s take our insights to the next level by visualizing data from multiple Atlas databases and a parquet file that’s stored in an AWS S3 bucket. Adding customer profile data that I offloaded from my CRM tool into S3 will enable me to find more robust insights. I could also visualize the data from the parquet file alone by connecting to that data lake collection. Since the contract data and customer profile data are in different collections within my Atlas Data Lake, I created a $lookup in the aggregation pipeline of the Charts data source. I then created a table chart from three different data sources with conditional formatting to quickly identify high value customers. The columns with blue boxes include contract data from both Atlas clusters, while the columns with orange boxes include customer profile data from a parquet file via AWS S3 bucket. Note, I could also aggregate the data in Atlas Data Lake and use $out to create a new collection of the data , and then connect Charts to the new collection as a data source. For the purposes of this blog, I wanted to highlight Charts-specific aggregation capabilities. We hope that you’re excited about the ability to easily visualize multiple data sources, from multiple Atlas databases to AWS S3 buckets in one place! Remember, if you haven’t used Charts before, you can get started for free by signing up for MongoDB Cloud , deploying an Atlas cluster and activating Charts. Try MongoDB Atlas for free today!