MongoDB Developer Centerchevron-right
Developer Topicschevron-right

MongoDB Atlas Data Lake Tutorial: Federated Queries and $out to AWS S3

Maxime BeugnetPublished Feb 07, 2022 • Updated Sep 23, 2022
AWSAtlasData Federation
Copy Link
facebook icontwitter iconlinkedin icon
random alt
Rate this tutorial
As of June 2022, the functionality previously known as Atlas Data Lake is now named Atlas Data Federation. Atlas Data Federation’s functionality is unchanged and you can learn more about it here. Atlas Data Lake will remain in the Atlas Platform, with newly introduced functionality that you can learn about here.
Last year at MongoDB World 2019, Eliot announced that MongoDB Atlas Data Lake was a new tool available in beta in the MongoDB Cloud Platform.
During this last year, MongoDB has been working closely with many customers to test this new tool and gathered much feedback to make it even better.
Today, after a year of refinement and improvement, MongoDB is proud to announce that MongoDB Atlas Data Lake is now generally available and can be used with confidence in your production environment.
In this tutorial, I will show you a new feature of MongoDB Atlas Data Lake called Federated Query that allows you to access your archived documents in S3 AND your documents in your MongoDB Atlas cluster with a SINGLE MQL query.
"MongoDB Atlas Data Lake Federated Queries")
This feature is really amazing because it allows you to have easy access to your archived data in S3 along with your "hot" data in your Atlas cluster. This feature could help you prevent your Atlas clusters from growing in size indefinitely and reduce your costs drastically. It also makes it easier to gain new insights by easily querying data residing in S3 and exposing it to your real-time app.
Finally, I will show you how to use the new version of the $out aggregation pipeline stage to write documents from a MongoDB Atlas cluster into an AWS S3 bucket.


In order to follow along this tutorial, you need to:
  • create at least an M10 cluster in MongoDB Atlas,
  • create a user in the Database Access menu,
  • add your IP address in the Network Access List in the Network Access menu,
  • have python 3 with pymongo and dnspython libs installed, - you will also need to have a Mongo Shell installed or maybe the new MongoDB Shell.
If you did these actions correctly, you should have an M10 (or bigger) cluster running in your MongoDB Atlas project.
"MongoDB Atlas M10 cluster"
And your Data Lake page should look like this:
"MongoDB Atlas Data Lake setup"
You should also have an AWS S3 bucket linked to your Atlas Data Lake setup. Mine is called cold-data-mongodb in this tutorial.
A MongoDB Atlas M10 or bigger cluster is required here because MongoDB Atlas Data Lake uses X.509 certificates which are not supported on MongoDB Atlas shared tier at this time.

We need some data

To illustrate how $out and federated queries work, I will use an overly simple dataset to keep things as easy as possible to understand.
Connect to your M10 cluster using the CONNECT button using the MongoDB Shell and let's insert these 4 documents in the test database:
This is the result you should see in your terminal:
"4 documents inserted in MongoDB Atlas M10 cluster"

Archive Data to S3 with $out

Now that we have a "massive" collection of orders, we can consider archiving the oldest orders to an S3 bucket. Let's imagine that once a month is over, I can archive all the orders from the previous month. I will create one JSON file in S3 for all the orders created during the previous month.
Let's transfer these orders to S3 using the aggregation pipeline stage $out.
But first, we need to configure Atlas Data Lake correctly.
Data Lake Configuration
The first thing we need to do is to make sure we can write to our S3 bucket and read the archived orders as well as the current orders in my M10 cluster.
This new feature in MongoDB Atlas Data Lake is called Federated Queries.
Now head to your Data Lake configuration:
"MongoDB Atlas Data Lake configuration"
And let's use the following configuration:
In this configuration, you can see that we have configured:
  • an S3 store: this is my S3 bucket that I named "cold-data-mongodb",
  • an Atlas store: this is my M10 cluster,
  • a database test with a collection orders that contains the data from this S3 store AND the data from my collection test.orders from my M10 cluster.
Feel free to replace cold-data-mongodb with your own bucket name. The very same one that you used during the Atlas Data Lake setup.
You can find your MongoDB Atlas project ID in your project settings:
"MongoDB Atlas Project ID"
In the path, I also told Atlas Data Lake that the JSON filename contains the min and max created dates of the orders it contains. This is useful for performance purposes: Atlas Data Lake won't have to scan all the files if I'm looking for an order on a given date. You can read more about data partitioning in the Data Lake documentation.
$out to S3
Let's now collect the URI we are going to use to connect to Atlas Data Lake.
Click on the connect button:
"MongoDB Atlas Data Lake connect button"
Click on "Connect your application" and collect your URI:
"MongoDB Atlas Data Lake URI"
Now let's use Python to execute our aggregation pipeline and archive the 2 orders from May 2020 in our S3 bucket.
To execute this code, make sure you have Python 3 and the dependencies:
And now we can confirm that our archive was created correctly in our S3 bucket:
"file in the S3 bucket"
Finish the Work
Now that our orders are safe in S3, I can delete these 2 orders from my Atlas cluster. Let's use Python again but this time, we need to use the URI from our Atlas cluster. The Atlas Data Lake URI doesn't allow this kind of operation.
Let's run this code:
Now let's double check what we have in S3. Here is the content of the S3 file I downloaded:
And here is what's left in my MongoDB Atlas cluster.
"Documents left in MongoDB Atlas cluster"
Federated Queries
As mentioned above already, federated queries in MongoDB Atlas Data Lake allow me to retain easy access to 100% of my data. I actually already used this feature when I ran the aggregation pipeline with the $out stage.
Let's verify this one last time with Python:
Here is the output:

Wrap Up

MongoDB Atlas Data Lake is now production-ready and generally available starting today.
If you have a lot of infrequently access data in your Atlas cluster but you still need to be able to query it and access it easily once you've archived it to S3, Atlas Data Lake and the new Federated Query feature will help you save tons of money. If you're looking for an automated way to archive your data from Atlas Clusters to fully-managed S3 storage, then check out our new Atlas Online Archive feature!
Storage on S3 is a lot cheaper than scaling up your MongoDB Atlas cluster because your cluster is full of cold data and needs more RAM & storage size to operate correctly.
All the python code is available in this Github Repository.
Please let me know on Twitter if you liked my blog post: @MBeugnet.
If you have questions, please head to our developer community website where the MongoDB engineers and the MongoDB community will give you a hand.

Copy Link
facebook icontwitter iconlinkedin icon
Rate this tutorial
How to work with Johns Hopkins University COVID-19 Data in MongoDB Atlas

May 31, 2022
Static Website Deployments to MongoDB with Hugo, Git, and Travis CI

Sep 23, 2022
Building a Multi-Environment Continuous Delivery Pipeline for MongoDB Atlas

May 13, 2022
Listen Along at Scale Up with Atlas Application Services

Jun 23, 2022
Table of Contents
  • Prerequisistes