MongoDB Atlas Data Lake Tutorial: Federated Queries and $out to AWS S3
Rate this tutorial
As of June 2022, the functionality previously known as Atlas Data Lake is now named Atlas Data Federation. Atlas Data Federation’s functionality is unchanged and you can learn more about it
here
. Atlas Data Lake will remain in the Atlas Platform, with newly introduced functionality that you can learn about here
.Last year at MongoDB World 2019, Eliot announced that
MongoDB Atlas Data Lake
was a new tool available in beta in the MongoDB Cloud Platform
.During this last year, MongoDB has been working closely with many customers to test this new tool and gathered much feedback to make it even better.
Today, after a year of refinement and improvement, MongoDB is proud to announce that MongoDB Atlas Data Lake is now generally available and can be used with confidence in your production environment.
In this tutorial, I will show you a new feature of MongoDB Atlas Data Lake called
Federated Query
that allows you to access your archived documents in S3 AND your documents in your MongoDB Atlas cluster with a SINGLE MQL query.
This feature is really amazing because it allows you to have easy access to your archived data in S3 along with your "hot" data in your Atlas cluster. This feature could help you prevent your Atlas clusters from growing in size indefinitely and reduce your costs drastically. It also makes it easier to gain new insights by easily querying data residing in S3 and exposing it to your real-time app.
Finally, I will show you how to use the new version of the
$out aggregation pipeline stage
to write documents from a MongoDB Atlas cluster into an AWS S3 bucket.In order to follow along this tutorial, you need to:
- create a user in the Database Access menu,
- add your IP address in the Network Access List in the Network Access menu,
- have python 3 with
pymongo
anddnspython
libs installed, - you will also need to have aMongo Shellinstalled or maybe the newMongoDB Shell.
If you did these actions correctly, you should have an M10 (or bigger) cluster running in your MongoDB Atlas project.

And your Data Lake page should look like this:

You should also have an AWS S3 bucket linked to your Atlas Data Lake setup. Mine is called
cold-data-mongodb
in this tutorial.A MongoDB Atlas
M10 or bigger cluster
is required here because MongoDB Atlas Data Lake uses X.509 certificates which are not supported on MongoDB Atlas shared tier at this time.To illustrate how
$out
and federated queries work, I will use an overly simple dataset to keep things as easy as possible to understand.Connect to your M10 cluster using the
CONNECT
button using the MongoDB Shell and let's insert these 4 documents in the test
database:This is the result you should see in your terminal:

Now that we have a "massive" collection of orders, we can consider archiving the oldest orders to an S3 bucket. Let's imagine that once a month is over, I can archive all the orders from the previous month. I will create one JSON file in S3 for all the orders created during the previous month.
But first, we need to configure Atlas Data Lake correctly.
The first thing we need to do is to make sure we can write to our S3 bucket and read the archived orders as well as the current orders in my M10 cluster.
Now head to your Data Lake configuration:

And let's use the following configuration:
In this configuration, you can see that we have configured:
- an S3
store
: this is my S3 bucket that I named "cold-data-mongodb", - an Atlas
store
: this is my M10 cluster, - a database
test
with a collectionorders
that contains the data from this S3 store AND the data from my collectiontest.orders
from my M10 cluster.
Feel free to replace
cold-data-mongodb
with your own bucket name. The very same one that you used during the Atlas Data Lake setup.You can find your MongoDB Atlas project ID in your project settings:

In the
path
, I also told Atlas Data Lake that the JSON filename contains the min and max created
dates of the orders it contains. This is useful for performance purposes: Atlas Data Lake won't have to scan all the files if I'm looking for an order on a given date. You can read more about data partitioning
in the Data Lake documentation.Let's now collect the URI we are going to use to connect to Atlas Data Lake.
Click on the connect button:

Click on "Connect your application" and collect your URI:

Now let's use Python to execute our aggregation pipeline and archive the 2 orders from May 2020 in our S3 bucket.
To execute this code, make sure you have Python 3 and the dependencies:
And now we can confirm that our archive was created correctly in our S3 bucket:

Now that our orders are safe in S3, I can delete these 2 orders from my Atlas cluster. Let's use Python again but this time, we need to use the URI from our Atlas cluster. The Atlas Data Lake URI doesn't allow this kind of operation.
Let's run this code:
Now let's double check what we have in S3. Here is the content of the S3 file I downloaded:
And here is what's left in my MongoDB Atlas cluster.

As mentioned above already, federated queries in MongoDB Atlas Data Lake allow me to retain easy access to 100% of my data. I actually already used this feature when I ran the aggregation pipeline with the
$out
stage.Let's verify this one last time with Python:
Here is the output:
MongoDB Atlas Data Lake is now production-ready and generally available starting today.
If you have a lot of infrequently access data in your Atlas cluster but you still need to be able to query it and access it easily once you've archived it to S3, Atlas Data Lake and the new Federated Query feature will help you save tons of money. If you're looking for an automated way to archive your data from Atlas Clusters to fully-managed S3 storage, then check out our new
Atlas Online Archive feature
!Storage on S3 is a lot cheaper than scaling up your MongoDB Atlas cluster because your cluster is full of cold data and needs more RAM & storage size to operate correctly.
If you have questions, please head to our
developer community website
where the MongoDB engineers and the MongoDB community will give you a hand.