As of June 2022, the functionality previously known as Atlas Data Lake is now named Atlas Data Federation. Atlas Data Federation’s functionality is unchanged and you can learn more about it here. Atlas Data Lake will remain in the Atlas Platform, with newly introduced functionality that you can learn about here.
MongoDB Atlas Data Lake is a new member of the MongoDB Atlas family which has just been announced at MongoDB World and is available in public beta. It brings the technology that has made MongoDB the most popular document database in the world and applies it to the great data lakes of the cloud. As companies have accumulated more and more data in cloud storage like Amazon S3, so the need to process that data effectively has risen.
With MongoDB Atlas Data Lake, you use the MongoDB Query Language, which is built for rich, complex structures and work with data stored in JSON, BSON, CSV, TSV, Avro, and Parquet formats. Data is analyzed on demand with no infrastructure setup and no time-consuming transformations, pre-processing or metadata management. There's no schema to pre-define, allowing you to work with your data faster.
MongoDB Atlas Data Lake demonstrated at MongoDB World 2019
As an on-demand service available in MongoDB's Atlas cloud data platform, there's no deployment process. All you need to begin your data exploration is to provide access to your S3 storage buckets. Users will configure Atlas Data Lake from the same UI as MongoDB Atlas operational clusters though a simple wizard to configure permissions, give read-only access to their S3 buckets, map S3 directories to databases and collections and get them ready to run queries. Atlas Data Lake will also provide stats on queries executed, data scanned and returned as well as average execution time.
Behind the scenes, the MongoDB Atlas Data Lake currently deploys multiple compute nodes to analyze each S3 bucket and process queries against that storage bucket's data. These nodes work in parallel and in the bucket's region for fast processing and to minimize data transfer and the associated cost. When done, each node returns its results to a central node that sorts, filters and aggregates the separate results into a final result as needed. For the Data Lake user, this process is entirely transparent and allows them to get on with their work extracting the value and insight from that data. It also means that there are no limits to concurrent queries being applied to the data. Future enhancements to the compute node architecture will also be transparent to the user.
MongoDB Atlas Data Lake is designed to get the best from your data lake with the tools and platforms you already use, whether you want to analyze data, build data services, feed machine learning and AI or build active archives.