How to Automate Continuous Data Copying from MongoDB to S3
Rate this tutorial
As of June 2022, the functionality previously known as Atlas Data Lake is now named Atlas Data Federation. Atlas Data Federation’s functionality is unchanged and you can learn more about it
. Atlas Data Lake will remain in the Atlas Platform, with newly introduced functionality that you can learn about
Modern always-on applications rely on automatic failover capabilities and real-time data access.
out of the box, but you might still want to copy your data into another location to run advanced analytics on your data or isolate your operational workload. For this reason, it can be incredibly useful to set up automatic continuous replication of your data for your workload.
In this post, we are going to set up a way to continuously copy data from a MongoDB database into an
bucket in the
data format by using
. We will first set up a
to consolidate a MongoDB database and our AWS S3 bucket. Next, we will set up a Trigger to automatically add a new document to a collection every minute, and another Trigger to automatically copy our data to our S3 bucket. Then, we will run a test to ensure that our data is being continuously copied into S3 from MongoDB. Finally, we’ll cover some items you’ll want to consider when building out something like this for your application.
Note: The values we use for certain parameters in this blog are for demonstration and testing purposes. If you plan on utilizing this functionality, we recommend you look at the “Production Considerations” section and adjust based on your needs.
One thing we frequently see users struggle with is getting NoSQL data into Parquet as it is a
. Historically, you would have to write some custom code to get the data out of the database, transform it into an appropriate structure, and then probably utilize a third-party library to write it to Parquet. Fortunately, with
's $out to S3, you can now convert MongoDB Data into Parquet with little effort.
In order to follow along with this tutorial yourself, you will need to do the following:
The first thing you'll need to do is navigate to "Data Lake" on the left-hand side of your Atlas Dashboard and then click "Create Data Lake" or "Configure a New Data Lake."
Then, you need to go ahead and connect your S3 bucket to your Atlas Data Lake. This is where we will write the Parquet files. The setup wizard should guide you through this pretty quickly, but you will need access to your credentials for AWS.
Select an AWS IAM role for Atlas.
- If you created a role that Atlas is already authorized to read and write to your S3 bucket, select this user.
Enter the S3 bucket information.
- Enter the name of your S3 bucket. I named my bucket
- Choose Read and write, to be able to write documents to your S3 bucket.
Assign an access policy to your AWS IAM role.
- Follow the steps in the Atlas user interface to assign an access policy to your AWS IAM role.
- Your role policy for read-only or read and write access should look similar to the following:
- Define the path structure for your files in the S3 bucket and click Next.
- Once you've connected your S3 bucket, we're going to create a simple data source to query the data in S3, so we can verify we've written the data to S3 at the end of this tutorial.
Now, we're going to connect our Atlas Cluster, so we can write data from it into the Parquet files on S3. This involves picking the cluster from a list of clusters in your Atlas project and then selecting the databases and collections you'd like to create Data Sources from and dragging them into your Data Lake.
Now that we have all of our data sources set up in our brand new Data Lake, we can now set up a
to automatically generate new documents every minute for our continuous replication demo. Triggers allow you to execute server-side logic in response to database events or according to a schedule. Atlas provides two kinds of Triggers: Database and Scheduled triggers. We will use a Scheduled trigger to ensure that these documents are automatically archived in our S3 bucket.
- Click the Atlas tab in the top navigation of your screen if you have not already navigated to Atlas.
- Click Triggers in the left-hand navigation.
- On the Overview tab of the Triggers page, click Add Trigger to open the trigger configuration page.
And our Trigger function looks like this:
Lastly, click Run and check that your database is getting new documents inserted into it every 60 seconds.
Alright, now is the fun part. We are going to create a new MongoDB Trigger that copies our MongoDB data every 60 seconds utilizing MongoDB Data Lake's $out to S3 aggregation pipeline. Create a new Trigger and use these configuration settings.
Your Trigger function will look something like this. But there's a lot going on, so let's break it down.
- Next, we are going to create an aggregation pipeline function to first query our MongoDB data that's more than 60 seconds old.
- Then, we will utilize the $out aggregate operator to replicate the data from our previous aggregation stage into S3.
- In the format, we're going to specify parquet and determine a maxFileSize and maxRowGroupSize.
- maxFileSize is going to determine the maximum size each partition will be.
- Lastly, we’re going to set our S3 path to match the value of the data.
If all is good, you should see your new Parquet document in your S3 bucket. I've enabled the AWS GUI to show you the versions so that you can see how it is being updated every 60 seconds automatically.
Some of the configurations chosen above were done so to make it easy to set up and test, but if you’re going to use this in production, you’ll want to adjust them.
Firstly, this blog was setup with a “deltas” approach. This means that we are only copying the new documents from our collection into our Parquet files. Another approach would be to do a full snapshot, i.e., copying the entire collection into Parquet each time. The approach you’re taking should depend on how much data is in your collection and what’s required by the downstream consumer.
Secondly, regardless of how much data you’re copying, ideally you want Parquet files to be larger, and for them to be partitioned based on how you’re going to query.
row group sizes of 512MB to 1GB. You can go smaller depending on your requirements, but as you can see, you want larger files. The other consideration is if you plan to query this data in the parquet format, you should partition it so that it aligns with your query pattern. If you’re going to query on a date field, for instance, you might want each file to have a single day's worth of data.
In this post, we walked through how to set up an automated continuous replication from a MongoDB database into an
bucket in the
data format by using
. First, we set up a new MongoDB Atlas Data Lake to consolidate a MongoDB database and our AWS S3 bucket. Then, we set up a Trigger to automatically add a new document to a collection every minute, and another Trigger to automatically back up these new automatically generated documents into our S3 bucket.
We also discussed how Parquet is a great format for your MongoDB data when you need to use columnar-oriented tools like Tableau for visualizations or Machine Learning frameworks that use Data Frames. Parquet can be quickly and easily converted into Pandas Data Frames in Python.
How to Use Custom Archival Rules and Partitioning on MongoDB Atlas Online Archive
Jun 07, 2022