UDF Announcement for MongoDB to BigQuery Dataflow Templates
Venkatesh Shanbhag, Dianna Annie JenoshPublished Jan 26, 2023 • Updated Jan 26, 2023
Rate this tutorial
Many enterprise customers using MongoDB Atlas as their core operational database also use BigQuery for their Batch and AI/ML based analytics, making it pivotal for seamless transfer of data between these entities. Since the announcement of the Dataflow templates (in October Of 2022) on moving data between MongoDB and BigQuery, we have seen a lot of interest from customers as it made it effortless for an append-only, one-to-one migration of data. Though the three Dataflow templates provided cater to most of the common use cases, there was also a demand to be able to do transformations as part of these templates.
Let’s have a quick look at how to set up a sample UDF to process (transform a field, flatten an embedded document, and delete a field) from an input document before writing the processed data to BigQuery.
- MongoDB Atlas setup through registration.
- MongoDB Atlas setup through GCP Marketplace. (MongoDB Atlas is available pay as you go in the GC marketplace).
- Create your MongoDB cluster.
- Click on Browse collections and click on +Create Database.
5: Name your database Sample_Company and collection Sample_Employee. 6: Click on INSERT DOCUMENT. Copy and paste the below document and click on Insert.
7: To have authenticated access on the MongoDB Sandbox cluster from Google console, we need to create database users.
Click on the Database Access from the left pane on the Atlas Dashboard.
Choose to Add New User using the green button on the left. Enter the username
appUser123. We will use built-in roles; click Add Default Privileges and in the Default Privileges section, add the roles readWriteAnyDatabase. Then press the green Add User button to create the user.
8: Whitelist the IPs.
For the purpose of this demo, we will allow access from any ip, i.e 0.0.0.0/0. However, this is not recommended for a production setup, where the recommendation will be to use VPC Peering and private IPs.
- Create a cloud storage bucket.
This function will read the document read from MongoDB using the Apache beam MongoDB IO connector. Flatten the embedded document Address/City to City. Delete the Address field and return the updated document.
4: Create a BigQuery Dataset in your project in the region close to your physical location.
5: Create a Dataflow pipeline.
a. Click on the Create Job from the template button at the top.
b. Job Name: mongodb-udf.
c. Region: Same as your BigQuery dataset region.
d. MongoDB connection URI: Copy the connection URI for connecting applications from MongoDB Atlas.
e. MongoDB Database: Sample_Company.
f. MongoDB Collection: Sample_Employee.
g. BigQuery Destination Table: Copy the destination table link from the BigQuery
h. Dataset details page in format: bigquery-project:sample_dataset.sample_company.
i. User Option: FLATTEN.
j. Click on show optional parameters.
6: Click on RUN JOB to start running the pipeline. Once the pipeline finishes running, your graph should show Succeeded on each stage as shown below.
7: After completion of the job, you will be able to see the transformed document inserted into BigQuery.
- A data pipeline for MongoDB Atlas and BigQuery using Dataflow.
- A data pipeline for MongoDB Atlas and BigQuery using the Confluent connector.
- Run analytics using BigQuery using BigQuery ML.
- Set up your first MongoDB cluster using Google Marketplace.