EventGet 50% off your ticket to MongoDB.local London on October 2. Use code WEB50Learn more >>
MongoDB Developer
Atlas
plus
Sign in to follow topics
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
Atlaschevron-right

UDF Announcement for MongoDB to BigQuery Dataflow Templates

DJ
Venkatesh Shanbhag, Diana Annie Jenosh4 min read • Published Jan 26, 2023 • Updated Apr 02, 2024
AIAtlasJavaScript
Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Many enterprise customers using MongoDB Atlas as their core operational database also use BigQuery for their Batch and AI/ML based analytics, making it pivotal for seamless transfer of data between these entities. Since the announcement of the Dataflow templates (in October Of 2022) on moving data between MongoDB and BigQuery, we have seen a lot of interest from customers as it made it effortless for an append-only, one-to-one migration of data. Though the three Dataflow templates provided cater to most of the common use cases, there was also a demand to be able to do transformations as part of these templates.
We are excited to announce the addition of the ability to write your own user-defined functions (UDFs) in these Dataflow pipelines! This new feature allows you to use UDFs in JavaScript to transform and analyze data within BigQuery. With UDFs, you can define custom logic and business rules that can be applied to your data as it is being processed by Dataflow. This allows you to perform complex transformations like transforming fields, concatenating fields, deleting fields, converting embedded documents to separate documents, etc. These UDFs take unprocessed documents as input parameters and return the processed documents as output.
To use UDFs with BigQuery Dataflow, simply write your JavaScript function and store it in the Google cloud storage bucket. Use the Dataflow templates’ optional parameter to read these UDFs while running the templates. The function will be executed on the data as it is being processed, allowing you to apply custom logic and transformations to your data during the transfer.

How to set it up

Let’s have a quick look at how to set up a sample UDF to process (transform a field, flatten an embedded document, and delete a field) from an input document before writing the processed data to BigQuery.

Set up MongoDB

  1. MongoDB Atlas setup through registration.
  2. MongoDB Atlas setup through GCP Marketplace. (MongoDB Atlas is available pay as you go in the GC marketplace).
  3. Create your MongoDB cluster.
  4. Click on Browse collections and click on +Create Database.
MongoDB Atlas cluster landing page
5: Name your database Sample_Company and collection Sample_Employee. MongoDB Atlas create collection step 6: Click on INSERT DOCUMENT. Insert document to MongoDB collection from Atlas UI Copy and paste the below document and click on Insert.
7: To have authenticated access on the MongoDB Sandbox cluster from Google console, we need to create database users.
Click on the Database Access from the left pane on the Atlas Dashboard.
Choose to Add New User using the green button on the left. Enter the username appUser and password appUser123. We will use built-in roles; click Add Default Privileges and in the Default Privileges section, add the roles readWriteAnyDatabase. Then press the green Add User button to create the user.
Create a database user on MongoDB Atlas
8: Whitelist the IPs.
For the purpose of this demo, we will allow access from any ip, i.e 0.0.0.0/0. However, this is not recommended for a production setup, where the recommendation will be to use VPC Peering and private IPs.
IP whitelisting on MongoDB Atlas to allow external access
Whitelist 0.0.0.0/0

Set up Google Cloud

  1. Create a cloud storage bucket.
  2. On your local machine, create a Javascript file transform.js and add below sample code.
This function will read the document read from MongoDB using the Apache beam MongoDB IO connector. Flatten the embedded document Address/City to City. Delete the Address field and return the updated document.
3: Upload the javascript file to the Google Cloud storage bucket.
4: Create a BigQuery Dataset in your project in the region close to your physical location.
5: Create a Dataflow pipeline.
a. Click on the Create Job from the template button at the top.
b. Job Name: mongodb-udf.
c. Region: Same as your BigQuery dataset region.
d. MongoDB connection URI: Copy the connection URI for connecting applications from MongoDB Atlas.
e. MongoDB Database: Sample_Company.
f. MongoDB Collection: Sample_Employee.
g. BigQuery Destination Table: Copy the destination table link from the BigQuery
h. Dataset details page in format: bigquery-project:sample_dataset.sample_company.
i. User Option: FLATTEN.
j. Click on show optional parameters.
k. Cloud storage location of your Javascript UDF: Browse your UDF file loaded to bucket location. This is the new feature that allows running the UDF and applies the transformations before inserting into BigQuery.
l. Name of your Javascript function: transform.
Create Dataflow Job with UDF on Google Cloud console
6: Click on RUN JOB to start running the pipeline. Once the pipeline finishes running, your graph should show Succeeded on each stage as shown below.
Dataflow Job Graph view on Google Cloud console
7: After completion of the job, you will be able to see the transformed document inserted into BigQuery.
Document read from MongoDB flattened and loaded to BigQuery

Conclusion

In this blog, we introduced UDFs to MongoDB to BigQuery Dataflow templates and their capabilities to transform the documents read from MongoDB using custom user defined Javascript functions stored on Google Cloud storage buckets. This blog also includes a simple tutorial on how to set up MongoDB Atlas, Google Cloud, and the UDFs.

Further reading


Facebook Icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Tutorial

Add Memory to Your JavaScript RAG Application Using MongoDB and LangChain


Aug 05, 2024 | 9 min read
Tutorial

How to Query from Multiple MongoDB Databases Using MongoDB Atlas Data Federation


Jan 23, 2024 | 7 min read
Tutorial

Securely Connect MongoDB to Cloud-Offered Kubernetes Clusters


Sep 09, 2024 | 4 min read
Tutorial

How to Choose the Best Embedding Model for Your LLM Application


Aug 30, 2024 | 16 min read
Table of Contents
  • How to set it up