How to Get MongoDB Data into Parquet in 10 Seconds or Less

Benjamin Flast5 min read • Published Jun 11, 2021 • Updated Jun 28, 2023

Parquet Atlas Data Federation

Rate this tutorial

For those of you not familiar with Parquet, it’s an amazing file format that does a lot of the heavy lifting to ensure blazing fast query performance on data stored in files. This is a popular file format in the Data Warehouse and Data Lake space as well as for a variety of machine learning tasks.

One thing we frequently see users struggle with is getting NoSQL data into Parquet as it is a columnar format. Historically, you would have to write some custom code to get the data out of the database, transform it into an appropriate structure, and then probably utilize a third-party library to write it to Parquet. Fortunately, with MongoDB Atlas Data Federation's $out to cloud object storage - Amazon S3 or Microsoft Azure Blob Storage, you can now convert MongoDB Data into Parquet with little effort.

In this blog post, I’m going to walk you through the steps necessary to write data from your Atlas Cluster directly to cloud object storage in the Parquet format and then finish up by reviewing some things to keep in mind when using Parquet with NoSQL data. I’m going to use a sample data set that contains taxi ride data from New York City.

Prerequisites

In order to follow along with this tutorial yourself, you will need the following: An Atlas cluster with some data in it. (It can be the sample data.) An AWS account with privileges to create IAM Roles and cloud object storage buckets (to give us access to write data to your cloud object storage bucket).

Create a Federated Database Instance and Connect to cloud object storage

The first thing you'll need to do is navigate to the "Data Federation" tab on the left hand side of your Atlas Dashboard and then click “set up manually” in the "create new federated database" dropdown in the top right corner of the UI.

Then, you need to connect your cloud object storage bucket to your Federated Database Instance. This is where we will write the Parquet files. The setup wizard should guide you through this pretty quickly but you will need access to your credentials for AWS. (Be sure to give Atlas Data Federation “Read and Write” access to the bucket so it can write the Parquet files there.)

Once you’ve connected your cloud object storage bucket, we’re going to create a simple data source to query the data in cloud object storage so we can verify we’ve written the data to cloud object storage at the end of this tutorial. Our new setup tool makes it easier than ever to configure your Federated Database Instance to take advantage of the partitioning of data in cloud object storage. Partitioning allows us to only select the relevant data to process in order to satisfy your query. (I’ve put a sample file in there for this test that will fit how we’re going to partition the data by _cab_type).

1 mongoimport --uri mongodb+srv://<USERNAME>:<PASSWORD>@<MONGODB_URI>/<DATABASE> --collection <COLLECTION> --type json --file <FILENAME>

Connect Your Federated Database Instance to an Atlas Cluster

Now we’re going to connect our Atlas cluster, so we can write data from it into the Parquet files. This involves picking the cluster from a list of clusters in your Atlas project and then selecting the databases and collections you’d like to create Data Sources from and dragging them into your Federated Database Instance.

$out to cloud object storage in Parquet

Now we’re going to connect to our Federated Database Instance using the mongo shell and execute the following command. This is going to do quite a few things, so I’m going to explain the important ones.

First, you can use the ‘filename’ field of the $out stage to have your Federated Database Instance partition files by “_cab_type”, so all the green cabs will go in one set of files and all the yellow cabs will go in another.
Then in the format, we’re going to specify parquet and determine a maxFileSize and maxRowGroupSize. -- maxFileSize is going to determine the maximum size each partition will be. -- maxRowGroupSize is going to determine how records are grouped inside of the Parquet file in “row groups” which will impact performance querying your Parquet files, similarly to file size.
Lastly, we’re using a special Atlas Data Federation aggregation “background: true” which simply tells the Federated Database Instance to keep executing the query even if the client disconnects. (This is handy for long running queries or environments where your network connection is not stable.)

1 db.getSiblingDB("clusterData").getCollection("trips").aggregate([
2     {
3         "$out" : {
4             "s3" : {
5                 "bucket" : "ben.flast",
6                 "region" : "us-east-1",
7                 "filename" : {
8                     "$concat" : [
9                         "taxi-trips/",
10                         "$_cab_type",
11                         "/"
12                     ]
13                 },
14                 "format" : {
15                     "name" : "parquet",
16                     "maxFileSize" : "10GB",
17                     "maxRowGroupSize" : "100MB"
18                 }
19             }
20         }
21     }
22 ], {
23     background: true
24 })

Blazing Fast Queries on Parquet Files

Now, to give you some idea of the potential performance improvements for Object Store Data you can see, I’ve written three sets of data, each with 10 million documents: one in Parquet, one in uncompressed JSON, and another in compressed JSON. And I ran a count command on each of them with the following results.

db.trips.count() 10,000,000

Type	Data Size (GB)	Count Command Latency (Seconds)
JSON (Uncompressed)	~16.1	297.182
JSON (Compressed)	~1.1	78.070
Parquet	~1.02	1.596

In Review

So, what have we done and what have we learned?

We saw how quickly and easily you can create a Federated Database Instance in MongoDB Atlas.
We connected an Atlas cluster to our Federated Database Instance.
We used our Federated Database Instance to write Atlas cluster data to cloud object storage in Parquet format.
We demonstrated how fast and space-efficient Parquet is when compared to JSON.

A Couple of Things to Remember About Atlas Data Federation

Parquet is a super fast columnar format that can be read and written with Atlas Data Federation.
Atlas Data Federation takes advantage of various pieces of metadata contained in Parquet files, not just the maxRowGroupSize. For instance, if your first stage in an aggregation pipeline was $project: {fieldA: 1, filedB: 1}, we would only read the two columns from the Parquet file which results in faster performance and lower costs as we are scanning less data.
Atlas Data Federation writes Parquet files flexibly so if you have polymorphic data, we will create union columns so you can have ‘Column A - String’ and ‘Column A - Int’. Atlas Data Federation will read union columns back in as one field but other tools may not handle union types. So if you’re going to be using these Parquet files with other tools, you should transform your data before the $out stage to ensure no union columns.
Atlas Data Federation will also write files with different schemas if it encounters data with varying schemas throughout the aggregation. It can handle different schemas across files in one collection, but other tools may require a consistent schema across files. So if you’re going to be using these Parquet files with other tools, you should do a $project with $convert’s before the $out stage to ensure a consistent schema across generated files.
Parquet is a great format for your MongoDB data when you need to use columnar oriented tools like Tableau for visualizations or machine learning frameworks that use data frames. Parquet can be quickly and easily converted into Pandas data frames in Python.

Rate this tutorial

Tutorial

Build an Image Search Engine With Python & MongoDB

Sep 18, 2024 | 8 min read

Tutorial

Enhancing LLM Accuracy Using MongoDB Vector Search and Unstructured.io Metadata

Dec 04, 2023 | 12 min read

Tutorial

Leveraging MongoDB Atlas Vector Search With LangChain

Sep 18, 2024 | 6 min read

Tutorial

Build Smart Applications With Atlas Vector Search and Google Vertex AI

Sep 18, 2024 | 4 min read

Prerequisites
Create a Federated Database Instance and Connect to cloud object storage
Connect Your Federated Database Instance to an Atlas Cluster
$out to cloud object storage in Parquet
Blazing Fast Queries on Parquet Files
In Review
A Couple of Things to Remember About Atlas Data Federation

Atlas

How to Get MongoDB Data into Parquet in 10 Seconds or Less

Prerequisites

Create a Federated Database Instance and Connect to cloud object storage

Connect Your Federated Database Instance to an Atlas Cluster

$out to cloud object storage in Parquet

Blazing Fast Queries on Parquet Files

In Review

A Couple of Things to Remember About Atlas Data Federation

Related

Build an Image Search Engine With Python & MongoDB

Enhancing LLM Accuracy Using MongoDB Vector Search and Unstructured.io Metadata

Leveraging MongoDB Atlas Vector Search With LangChain

Build Smart Applications With Atlas Vector Search and Google Vertex AI

Table of Contents

1	db.getSiblingDB("clusterData").getCollection("trips").aggregate([
2	{
3	"$out" : {
4	"s3" : {
5	"bucket" : "ben.flast",
6	"region" : "us-east-1",
7	"filename" : {
8	"$concat" : [
9	"taxi-trips/",
10	"$_cab_type",
11	"/"
12	]
13	},
14	"format" : {
15	"name" : "parquet",
16	"maxFileSize" : "10GB",
17	"maxRowGroupSize" : "100MB"
18	}
19	}
20	}
21	}
22	], {
23	background: true
24	})