MongoDB Atlas Data Lake Tutorial: Federated Queries and $out to AWS S3
Rate this tutorial
As of June 2022, the functionality previously known as Atlas Data Lake is now named Atlas Data Federation. Atlas Data Federation’s functionality is unchanged and you can learn more about it . Atlas Data Lake will remain in the Atlas Platform, with newly introduced functionality that you can learn about .
During this last year, MongoDB has been working closely with many customers to test this new tool and gathered much feedback to make it even better.
Today, after a year of refinement and improvement, MongoDB is proud to announce that MongoDB Atlas Data Lake is now generally available and can be used with confidence in your production environment.
This feature is really amazing because it allows you to have easy access to your archived data in S3 along with your "hot" data in your Atlas cluster. This feature could help you prevent your Atlas clusters from growing in size indefinitely and reduce your costs drastically. It also makes it easier to gain new insights by easily querying data residing in S3 and exposing it to your real-time app.
In order to follow along this tutorial, you need to:
- create a user in the Database Access menu,
- add your IP address in the Network Access List in the Network Access menu,
If you did these actions correctly, you should have an M10 (or bigger) cluster running in your MongoDB Atlas project.
And your Data Lake page should look like this:
You should also have an AWS S3 bucket linked to your Atlas Data Lake setup. Mine is called
cold-data-mongodbin this tutorial.
To illustrate how
$outand federated queries work, I will use an overly simple dataset to keep things as easy as possible to understand.
Connect to your M10 cluster using the
CONNECTbutton using the MongoDB Shell and let's insert these 4 documents in the
This is the result you should see in your terminal:
Now that we have a "massive" collection of orders, we can consider archiving the oldest orders to an S3 bucket. Let's imagine that once a month is over, I can archive all the orders from the previous month. I will create one JSON file in S3 for all the orders created during the previous month.
But first, we need to configure Atlas Data Lake correctly.
The first thing we need to do is to make sure we can write to our S3 bucket and read the archived orders as well as the current orders in my M10 cluster.
Now head to your Data Lake configuration:
And let's use the following configuration:
In this configuration, you can see that we have configured:
- an S3
store: this is my S3 bucket that I named "cold-data-mongodb",
- an Atlas
store: this is my M10 cluster,
- a database
testwith a collection
ordersthat contains the data from this S3 store AND the data from my collection
test.ordersfrom my M10 cluster.
Feel free to replace
cold-data-mongodbwith your own bucket name. The very same one that you used during the Atlas Data Lake setup.
You can find your MongoDB Atlas project ID in your project settings:
path, I also told Atlas Data Lake that the JSON filename contains the min and max
createddates of the orders it contains. This is useful for performance purposes: Atlas Data Lake won't have to scan all the files if I'm looking for an order on a given date. You can read more about in the Data Lake documentation.
Let's now collect the URI we are going to use to connect to Atlas Data Lake.
Click on the connect button:
Click on "Connect your application" and collect your URI:
Now let's use Python to execute our aggregation pipeline and archive the 2 orders from May 2020 in our S3 bucket.
To execute this code, make sure you have Python 3 and the dependencies:
And now we can confirm that our archive was created correctly in our S3 bucket:
Now that our orders are safe in S3, I can delete these 2 orders from my Atlas cluster. Let's use Python again but this time, we need to use the URI from our Atlas cluster. The Atlas Data Lake URI doesn't allow this kind of operation.
Let's run this code:
Now let's double check what we have in S3. Here is the content of the S3 file I downloaded:
And here is what's left in my MongoDB Atlas cluster.
As mentioned above already, federated queries in MongoDB Atlas Data Lake allow me to retain easy access to 100% of my data. I actually already used this feature when I ran the aggregation pipeline with the
Let's verify this one last time with Python:
Here is the output:
MongoDB Atlas Data Lake is now production-ready and generally available starting today.
If you have a lot of infrequently access data in your Atlas cluster but you still need to be able to query it and access it easily once you've archived it to S3, Atlas Data Lake and the new Federated Query feature will help you save tons of money. If you're looking for an automated way to archive your data from Atlas Clusters to fully-managed S3 storage, then check out our new !
Storage on S3 is a lot cheaper than scaling up your MongoDB Atlas cluster because your cluster is full of cold data and needs more RAM & storage size to operate correctly.