How to Automate Continuous Data Copying from MongoDB to S3

Joe Karlsson8 min read • Published Feb 07, 2022 • Updated Jan 23, 2024

Parquet AWS Atlas Data Federation

Rate this tutorial

Modern always-on applications rely on automatic failover capabilities and real-time data access. MongoDB Atlas already supports automatic backups out of the box, but you might still want to copy your data into another location to run advanced analytics on your data or isolate your operational workload. For this reason, it can be incredibly useful to set up automatic continuous replication of your data for your workload.

In this post, we are going to set up a way to continuously copy data from a MongoDB database into an AWS S3 bucket in the Parquet data format by using MongoDB Atlas Database Triggers. We will first set up a Federated Database Instance using MongoDB Atlas Data Federation to consolidate a MongoDB database and our AWS S3 bucket. Next, we will set up a Trigger to automatically add a new document to a collection every minute, and another Trigger to automatically copy our data to our S3 bucket. Then, we will run a test to ensure that our data is being continuously copied into S3 from MongoDB. Finally, we’ll cover some items you’ll want to consider when building out something like this for your application.

Note: The values we use for certain parameters in this blog are for demonstration and testing purposes. If you plan on utilizing this functionality, we recommend you look at the “Production Considerations” section and adjust based on your needs.

What is Parquet?

For those of you not familiar with Parquet, it's an amazing file format that does a lot of the heavy lifting to ensure blazing fast query performance on data stored in files. This is a popular file format in the Data Warehouse and Data Lake space as well as for a variety of machine learning tasks.

One thing we frequently see users struggle with is getting NoSQL data into Parquet as it is a columnar format. Historically, you would have to write some custom code to get the data out of the database, transform it into an appropriate structure, and then probably utilize a third-party library to write it to Parquet. Fortunately, with MongoDB Atlas Data Federation's $out to S3, you can now convert MongoDB Data into Parquet with little effort.

Prerequisites

In order to follow along with this tutorial yourself, you will need to do the following:

Create a MongoDB Atlas account, if you do not have one already.
Create an AWS account with privileges to create IAM Roles and S3 Buckets (to give Data Federation access to write data to your S3 bucket). Already have an AWS account? Atlas supports paying for usage via the AWS Marketplace (AWS MP) without any upfront commitment — simply
sign up for MongoDB Atlas via AWS Marketplace.
Install the AWS CLI. 4. Configure the AWS CLI.
Optional: Set up unified AWS access.

Create a Federated Database Instance and Connect to S3

We need to set up a Federated Database Instance to copy our MongoDB data and utilize MongoDB Atlas Data Federation's $out to S3 to convert our MongoDB Data into Parquet and land it in an S3 bucket.

The first thing you'll need to do is navigate to "Data Federation" on the left-hand side of your Atlas Dashboard and then click “set up manually” in the "create new federated database" dropdown in the top right corner of the UI.

Then, you need to go ahead and connect your S3 bucket to your Federated Database Instance. This is where we will write the Parquet files. The setup wizard should guide you through this pretty quickly, but you will need access to your credentials for AWS.

Note: For more information, be sure to refer to the documentation on deploying a Federated Database Instance for a S3 data store. (Be sure to give Atlas Data Federation "Read and Write" access to the bucket, so it can write the Parquet files there).

Select an AWS IAM role for Atlas.

If you created a role that Atlas is already authorized to read and write to your S3 bucket, select this user.
If you are authorizing Atlas for an existing role or are creating a new role, be sure to refer to the documentation for how to do this.

Enter the S3 bucket information.

Enter the name of your S3 bucket. I named my bucket mongodb-data-lake-demo.
Choose Read and write, to be able to write documents to your S3 bucket.

Assign an access policy to your AWS IAM role.

Follow the steps in the Atlas user interface to assign an access policy to your AWS IAM role.
Your role policy for read-only or read and write access should look similar to the following:

1 {
2    "Version": "2012-10-17",
3    "Statement": [
4       {
5             "Effect": "Allow",
6             "Action": [
7                "s3:ListBucket",
8                "s3:GetObject",
9                "s3:GetObjectVersion",
10                "s3:GetBucketLocation"
11             ],
12             "Resource": [
13                <role arn>
14             ]
15       }
16    ]
17 }

Define the path structure for your files in the S3 bucket and click Next.
Once you've connected your S3 bucket, we're going to create a simple data source to query the data in S3, so we can verify we've written the data to S3 at the end of this tutorial.

Connect Your MongoDB Database to Your Federated Database Instance

Now, we're going to connect our Atlas Cluster, so we can write data from it into the Parquet files on S3. This involves picking the cluster from a list of clusters in your Atlas project and then selecting the databases and collections you'd like to create Data Sources from and dragging them into your Federated Database Instance.

Create a MongoDB Atlas Trigger to Create a New Document Every Minute

Now that we have all of our data sources set up in our brand new Federated Database Instance, we can now set up a MongoDB Database Trigger to automatically generate new documents every minute for our continuous replication demo. Triggers allow you to execute server-side logic in response to database events or according to a schedule. Atlas provides two kinds of Triggers: Database and Scheduled triggers. We will use a Scheduled trigger to ensure that these documents are automatically archived in our S3 bucket.

Click the Atlas tab in the top navigation of your screen if you have not already navigated to Atlas.
Click Triggers in the left-hand navigation.
On the Overview tab of the Triggers page, click Add Trigger to open the trigger configuration page.
Enter these configuration values for our trigger:

And our Trigger function looks like this:

1 exports = function () {
2 
3    const mongodb = context.services.get("NAME_OF_YOUR_ATLAS_SERVICE");
4    const db = mongodb.db("NAME_OF_YOUR DATABASE")
5    const events = db.collection("NAME_OF_YOUR_COLLECTION");
6 
7    const event = events.insertOne(
8       {
9             time: new Date(),
10             aNumber: Math.random() * 100,
11             type: "event"
12       }
13    );
14 
15    return JSON.stringify(event);
16 
17 };

Lastly, click Run and check that your database is getting new documents inserted into it every 60 seconds.

Create a MongoDB Atlas Trigger to Copy New MongoDB Data into S3 Every Minute

Alright, now is the fun part. We are going to create a new MongoDB Trigger that copies our MongoDB data every 60 seconds utilizing MongoDB Atlas Data Federation's $out to S3 aggregation pipeline. Create a new Trigger and use these configuration settings.

Your Trigger function will look something like this. But there's a lot going on, so let's break it down.

First, we are going to connect to our new Federated Database Instance. This is different from the previous Trigger that connected to our Atlas database. Be sure to put your virtual database name in for context.services.get. You must connect to your Federated Database Instance to use $out to S3.
Next, we are going to create an aggregation pipeline function to first query our MongoDB data that's more than 60 seconds old.
Then, we will utilize the $out aggregate operator to replicate the data from our previous aggregation stage into S3.
In the format, we're going to specify parquet and determine a maxFileSize and maxRowGroupSize.
- maxFileSize is going to determine the maximum size each partition will be. maxRowGroupSize is going to determine how records are grouped inside of the parquet file in "row groups" which will impact performance querying your Parquet files similarly to file size.
Lastly, we’re going to set our S3 path to match the value of the data.

1 exports = function () {
2 
3    const service = context.services.get("NAME_OF_YOUR_FEDERATED_DATA_SERVICE");
4    const db = service.db("NAME_OF_YOUR_VIRTUAL_DATABASE")
5    const events = db.collection("NAME_OF_YOUR_VIRTUAL_COLLECTION");
6 
7    const pipeline = [
8       {
9             $match: {
10                "time": {
11                   $gt: new Date(Date.now() - 60 * 60 * 1000),
12                   $lt: new Date(Date.now())
13                }
14             }
15       }, {
16             "$out": {
17                "s3": {
18                   "bucket": "mongodb-federated-data-demo",
19                   "region": "us-east-1",
20                   "filename": "events",
21                   "format": {
22                         "name": "parquet",
23                         "maxFileSize": "10GB",
24                         "maxRowGroupSize": "100MB"
25                   }
26                }
27             }
28       }
29    ];
30 
31    return events.aggregate(pipeline);
32 };

If all is good, you should see your new Parquet document in your S3 bucket. I've enabled the AWS GUI to show you the versions so that you can see how it is being updated every 60 seconds automatically.

Production Considerations

Some of the configurations chosen above were done so to make it easy to set up and test, but if you’re going to use this in production, you’ll want to adjust them.

Firstly, this blog was setup with a “deltas” approach. This means that we are only copying the new documents from our collection into our Parquet files. Another approach would be to do a full snapshot, i.e., copying the entire collection into Parquet each time. The approach you’re taking should depend on how much data is in your collection and what’s required by the downstream consumer.

Secondly, regardless of how much data you’re copying, ideally you want Parquet files to be larger, and for them to be partitioned based on how you’re going to query. Apache recommends row group sizes of 512MB to 1GB. You can go smaller depending on your requirements, but as you can see, you want larger files. The other consideration is if you plan to query this data in the parquet format, you should partition it so that it aligns with your query pattern. If you’re going to query on a date field, for instance, you might want each file to have a single day's worth of data.

Lastly, depending on your needs, it may be appropriate to look into an alternative scheduling device to triggers, like Temporal or Apache Airflow.

Wrap Up

In this post, we walked through how to set up an automated continuous replication from a MongoDB database into an AWS S3 bucket in the Parquet data format by using MongoDB Atlas Data Federation and MongoDB Atlas Database Triggers. First, we set up a new Federated Database Instance to consolidate a MongoDB database and our AWS S3 bucket. Then, we set up a Trigger to automatically add a new document to a collection every minute, and another Trigger to automatically back up these new automatically generated documents into our S3 bucket.

We also discussed how Parquet is a great format for your MongoDB data when you need to use columnar-oriented tools like Tableau for visualizations or Machine Learning frameworks that use Data Frames. Parquet can be quickly and easily converted into Pandas Data Frames in Python.

If you have questions, please head to our developer community website where the MongoDB engineers and the MongoDB community will help you build your next big idea with MongoDB.

Additional Resources:

Rate this tutorial

Tutorial

Atlas Search Multi-Language Data Modeling

Sep 09, 2022 | 2 min read

Tutorial

Improve Your App's Search Results with Auto-Tuning

Aug 14, 2024 | 5 min read

Tutorial

How to Query from Multiple MongoDB Databases Using MongoDB Atlas Data Federation

Jan 23, 2024 | 7 min read

Tutorial

Interactive RAG With MongoDB Atlas + Function Calling API

Sep 18, 2024 | 16 min read

What is Parquet?
Prerequisites
Create a Federated Database Instance and Connect to S3
Connect Your MongoDB Database to Your Federated Database Instance
Create a MongoDB Atlas Trigger to Create a New Document Every Minute
Create a MongoDB Atlas Trigger to Copy New MongoDB Data into S3 Every Minute
Production Considerations
Wrap Up

Atlas

How to Automate Continuous Data Copying from MongoDB to S3

What is Parquet?

Prerequisites

Create a Federated Database Instance and Connect to S3

Connect Your MongoDB Database to Your Federated Database Instance

Create a MongoDB Atlas Trigger to Create a New Document Every Minute

Create a MongoDB Atlas Trigger to Copy New MongoDB Data into S3 Every Minute

Production Considerations

Wrap Up

Related

Atlas Search Multi-Language Data Modeling

Improve Your App's Search Results with Auto-Tuning

How to Query from Multiple MongoDB Databases Using MongoDB Atlas Data Federation

Interactive RAG With MongoDB Atlas + Function Calling API

Table of Contents

1	{
2	"Version": "2012-10-17",
3	"Statement": [
4	{
5	"Effect": "Allow",
6	"Action": [
7	"s3:ListBucket",
8	"s3:GetObject",
9	"s3:GetObjectVersion",
10	"s3:GetBucketLocation"
11	],
12	"Resource": [
13	<role arn>
14	]
15	}
16	]
17	}

1	exports = function () {
2
3	const mongodb = context.services.get("NAME_OF_YOUR_ATLAS_SERVICE");
4	const db = mongodb.db("NAME_OF_YOUR DATABASE")
5	const events = db.collection("NAME_OF_YOUR_COLLECTION");
6
7	const event = events.insertOne(
8	{
9	time: new Date(),
10	aNumber: Math.random() * 100,
11	type: "event"
12	}
13	);
14
15	return JSON.stringify(event);
16
17	};

1	exports = function () {
2
3	const service = context.services.get("NAME_OF_YOUR_FEDERATED_DATA_SERVICE");
4	const db = service.db("NAME_OF_YOUR_VIRTUAL_DATABASE")
5	const events = db.collection("NAME_OF_YOUR_VIRTUAL_COLLECTION");
6
7	const pipeline = [
8	{
9	$match: {
10	"time": {
11	$gt: new Date(Date.now() - 60 * 60 * 1000),
12	$lt: new Date(Date.now())
13	}
14	}
15	}, {
16	"$out": {
17	"s3": {
18	"bucket": "mongodb-federated-data-demo",
19	"region": "us-east-1",
20	"filename": "events",
21	"format": {
22	"name": "parquet",
23	"maxFileSize": "10GB",
24	"maxRowGroupSize": "100MB"
25	}
26	}
27	}
28	}
29	];
30
31	return events.aggregate(pipeline);
32	};