How to Automate Continuous Data Copying from MongoDB to S3
Joe Karlsson8 min read • Published Feb 07, 2022 • Updated Jan 23, 2024
Rate this tutorial
Modern always-on applications rely on automatic failover capabilities and real-time data access. MongoDB Atlas already supports automatic backups out of the box, but you might still want to copy your data into another location to run advanced analytics on your data or isolate your operational workload. For this reason, it can be incredibly useful to set up automatic continuous replication of your data for your workload.
In this post, we are going to set up a way to continuously copy data from a MongoDB database into an AWS S3 bucket in the Parquet data format by using MongoDB Atlas Database Triggers. We will first set up a Federated Database Instance using MongoDB Atlas Data Federation to consolidate a MongoDB database and our AWS S3 bucket. Next, we will set up a Trigger to automatically add a new document to a collection every minute, and another Trigger to automatically copy our data to our S3 bucket. Then, we will run a test to ensure that our data is being continuously copied into S3 from MongoDB. Finally, we’ll cover some items you’ll want to consider when building out something like this for your application.
Note: The values we use for certain parameters in this blog are for demonstration and testing purposes. If you plan on utilizing this functionality, we recommend you look at the “Production Considerations” section and adjust based on your needs.
For those of you not familiar with Parquet, it's an amazing file format that does a lot of the heavy lifting to ensure blazing fast query performance on data stored in files. This is a popular file format in the Data Warehouse and Data Lake space as well as for a variety of machine learning tasks.
One thing we frequently see users struggle with is getting NoSQL data into Parquet as it is a columnar format. Historically, you would have to write some custom code to get the data out of the database, transform it into an appropriate structure, and then probably utilize a third-party library to write it to Parquet. Fortunately, with MongoDB Atlas Data Federation's $out to S3, you can now convert MongoDB Data into Parquet with little effort.
In order to follow along with this tutorial yourself, you will need to
do the following:
- Create an AWS account with privileges to create IAM Roles and S3 Buckets (to give Data Federation access to write data to your S3 bucket). Already have an AWS account? Atlas supports paying for usage via the AWS Marketplace (AWS MP) without any upfront commitment — simply sign up for MongoDB Atlas via AWS Marketplace.
We need to set up a Federated Database Instance to copy our MongoDB data and utilize MongoDB Atlas Data Federation's $out to S3 to convert our MongoDB Data into Parquet and land it in an S3 bucket.
The first thing you'll need to do is navigate to "Data Federation" on the left-hand side of your Atlas Dashboard and then click “set up manually” in the "create new federated database" dropdown in the top right corner of the UI.
Then, you need to go ahead and connect your S3 bucket to your Federated Database Instance. This is where we will write the Parquet files. The setup wizard should guide you through this pretty quickly, but you will need access to your credentials for AWS.
Note: For more information, be sure to refer to the documentation on deploying a Federated Database Instance for a S3 data store. (Be sure to give Atlas Data Federation "Read and Write" access to the bucket, so it can write the Parquet files there).
Select an AWS IAM role for Atlas.
- If you created a role that Atlas is already authorized to read and write to your S3 bucket, select this user.
- If you are authorizing Atlas for an existing role or are creating a new role, be sure to refer to the documentation for how to do this.
Enter the S3 bucket information.
- Enter the name of your S3 bucket. I named my bucket
mongodb-data-lake-demo
. - Choose Read and write, to be able to write documents to your S3 bucket.
Assign an access policy to your AWS IAM role.
- Follow the steps in the Atlas user interface to assign an access policy to your AWS IAM role.
- Your role policy for read-only or read and write access should look similar to the following:
- Define the path structure for your files in the S3 bucket and click Next.
- Once you've connected your S3 bucket, we're going to create a simple data source to query the data in S3, so we can verify we've written the data to S3 at the end of this tutorial.
Now, we're going to connect our Atlas Cluster, so we can write data from it into the Parquet files on S3. This involves picking the cluster from a list of clusters in your Atlas project and then selecting the databases and collections you'd like to create Data Sources from and dragging them into your Federated Database Instance.
Now that we have all of our data sources set up in our brand new Federated Database Instance, we can now set up a MongoDB Database Trigger to automatically generate new documents every minute for our continuous replication demo. Triggers allow you to execute server-side logic in response to database events or according to a schedule. Atlas provides two kinds of Triggers: Database and Scheduled triggers. We will use a Scheduled trigger to ensure that these documents are automatically archived in our S3 bucket.
- Click the Atlas tab in the top navigation of your screen if you have not already navigated to Atlas.
- Click Triggers in the left-hand navigation.
- On the Overview tab of the Triggers page, click Add Trigger to open the trigger configuration page.
And our Trigger function looks like this:
Lastly, click Run and check that your database is getting new documents inserted into it every 60 seconds.
Alright, now is the fun part. We are going to create a new MongoDB Trigger that copies our MongoDB data every 60 seconds utilizing MongoDB Atlas Data Federation's $out to S3 aggregation pipeline. Create a new Trigger and use these configuration settings.
Your Trigger function will look something like this. But there's a lot going on, so let's break it down.
- Next, we are going to create an aggregation pipeline function to first query our MongoDB data that's more than 60 seconds old.
- Then, we will utilize the $out aggregate operator to replicate the data from our previous aggregation stage into S3.
- In the format, we're going to specify parquet and determine a maxFileSize and maxRowGroupSize.
- maxFileSize is going to determine the maximum size each partition will be. maxRowGroupSize is going to determine how records are grouped inside of the parquet file in "row groups" which will impact performance querying your Parquet files similarly to file size.
- Lastly, we’re going to set our S3 path to match the value of the data.
If all is good, you should see your new Parquet document in your S3 bucket. I've enabled the AWS GUI to show you the versions so that you can see how it is being updated every 60 seconds automatically.
Some of the configurations chosen above were done so to make it easy to set up and test, but if you’re going to use this in production, you’ll want to adjust them.
Firstly, this blog was setup with a “deltas” approach. This means that we are only copying the new documents from our collection into our Parquet files. Another approach would be to do a full snapshot, i.e., copying the entire collection into Parquet each time. The approach you’re taking should depend on how much data is in your collection and what’s required by the downstream consumer.
Secondly, regardless of how much data you’re copying, ideally you want Parquet files to be larger, and for them to be partitioned based on how you’re going to query. Apache recommends row group sizes of 512MB to 1GB. You can go smaller depending on your requirements, but as you can see, you want larger files. The other consideration is if you plan to query this data in the parquet format, you should partition it so that it aligns with your query pattern. If you’re going to query on a date field, for instance, you might want each file to have a single day's worth of data.
Lastly, depending on your needs, it may be appropriate to look into an alternative scheduling device to triggers, like Temporal or Apache Airflow.
In this post, we walked through how to set up an automated continuous replication from a MongoDB database into an AWS S3 bucket in the Parquet data format by using MongoDB Atlas Data Federation and MongoDB Atlas Database Triggers. First, we set up a new Federated Database Instance to consolidate a MongoDB database and our AWS S3 bucket. Then, we set up a Trigger to automatically add a new document to a collection every minute, and another Trigger to automatically back up these new automatically generated documents into our S3 bucket.
We also discussed how Parquet is a great format for your MongoDB data when you need to use columnar-oriented tools like Tableau for visualizations or Machine Learning frameworks that use Data Frames. Parquet can be quickly and easily converted into Pandas Data Frames in Python.
If you have questions, please head to our developer community website where the MongoDB engineers and the MongoDB community will help you build your next big idea with MongoDB.
Additional Resources: