MongoDB Developer
MongoDB
MongoDB Developer Centerchevron-right
Developer Topicschevron-right
Productschevron-right
MongoDBchevron-right

Create a Data Pipeline for MongoDB Change Stream Using Pub/Sub BigQuery Subscription

Venkatesh Shanbhag, Stanimira VlaevaPublished Oct 28, 2022 • Updated Oct 28, 2022
GCPNodejsMongoDBChange StreamsJavaScript
Copy Link
facebook icontwitter iconlinkedin icon
random alt
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
On 1st October 2022, MongoDB and Google announced a set of open source Dataflow templates for moving data between MongoDB and BigQuery to run analyses on BigQuery using BQML and to bring back inferences to MongoDB. Three templates were introduced as part of this release, including the MongoDB to BigQuery CDC (change data capture) template.
MongoDB change stream
This template requires users to run the change stream on MongoDB, which will monitor inserts and updates on the collection. These changes will be captured and pushed to a Pub/Sub topic. The CDC template will create a job to read the data from the topic and get the changes, apply the transformation, and write the changes to BigQuery. The transformations will vary based on the user input while running the Dataflow job.
Alternatively, you can use a native Pub/Sub capability to set up a data pipeline between your MongoDB cluster and BigQuery. The Pub/Sub BigQuery subscription writes messages to an existing BigQuery table as they are received. Without the BigQuery subscription type, you need a pull or push subscription and a subscriber (such as Dataflow) that reads messages and writes them to BigQuery.
This article explains how to set up the BigQuery subscription to process data read from a MongoDB change stream. As a prerequisite, you’ll need a MongoDB Atlas cluster.
To set up a free tier cluster, you can register to MongoDB either from Google Cloud Marketplace or from the registration page. Follow the steps in the MongoDB documentation to configure the database user and network settings for your cluster.
On Google Cloud, we will create a Pub/Sub topic, a BigQuery dataset, and a table before creating the BigQuery subscription.

Create a BigQuery dataset

We’ll start by creating a new dataset for BigQuery in Google Cloud console.
Then, add a new table in your dataset. Define it with a name of your choice and the following schema:
Field nameType
idSTRING
source_dataSTRING
TimestampSTRING
Create a new dataset and table in BigQuery

Configure Google Cloud Pub/Sub

Next, we’ll configure a Pub/Sub schema and topic to ingest the messages from our MongoDB change stream. Then, we’ll create a subscription to write the received messages to the BigQuery table we just created.
For this section, we’ll use the Google Cloud Pub/Sub API. Before proceeding, make sure you have enabled the API for your project.
Define a Pub/Sub schema
From the Cloud Pub/Sub UI, Navigate to Create Schema.
Provide an appropriate identifier, such as “mdb-to-bq-schema,” to your schema. Then, select “Avro” for the type. Finally, add the following definition to match the fields from your BigQuery table:
Create a Cloud Pub/Sub schema
Create a Pub/Sub topic
From the sidebar, navigate to “Topics” and click on Create a topic.
Give your topic an identifier, such as “MongoDBCDC.” Enable the Use a schema field and select the schema that you just created. Leave the rest of the parameters to default and click on Create Topic.
Create a Cloud Pub/Sub topic
Subscribe to topic and write to BigQuery
From inside the topic, click on Create new subscription. Configure your subscription in the following way:
  • Provide a subscription ID — for example, “mdb-cdc.”
  • Define the Delivery type to Write to BigQuery.
  • Select your BigQuery dataset from the dropdown.
  • Provide the name of the table you created in the BigQuery dataset.
  • Enable Use topic schema.
You need to have a bigquery.dataEditor role on your service account to create a Pub/Sub BigQuery subscription. To grant access using the bq command line tool, run the following command:
Keep the other fields as default and click on Create subscription.
Create a Cloud Pub/Sub subscription

Set up a change stream on a MongoDB cluster

Finally, we’ll set up a change stream that listens for new documents inserted in our MongoDB cluster.
We’ll use Node.js but you can adapt the code to a programming language of your choice. Check out the Google Cloud documentation for more Pub/Sub examples using a variety of languages. You can find the source code of this example in the dedicated GitHub repository.
First, set up a new Node.js project and install the following dependencies.
Then, add an Avro schema, matching the one we created in Google Cloud Pub/Sub:
./document-message.avsc
Then create a new JavaScript module — index.mjs. Start by importing the required libraries and setting up your MongoDB connection string and your Pub/Sub topic name. If you don’t already have a MongoDB cluster, you can create one for free in MongoDB Atlas.
./index.mjs
After this, we can connect to our MongoDB instance and set up a change stream event listener. Using an aggregation pipeline, we’ll watch only for “insert” events on the specified collection. We’ll also define a 60-second timeout before closing the change stream.
./index.mjs
Finally, we’ll define the publishDocumentAsMessage() function that will:
  1. Transform every MongoDB document received through the change stream.
  2. Convert it to the data buffer following the Avro schema.
  3. Publish it to the Pub/Sub topic in Google Cloud.
Run the file to start the change stream listener:
Insert a new document in your MongoDB collection to watch it go through the data pipeline and appear in your BigQuery table!

Summary

There are multiple ways to load the change stream data from MongoDB to BigQuery and we have shown how to use the BigQuery subscription on Pub/Sub. The change streams from MongoDB are monitored, captured, and later written to a Pub/Sub topic using Java libraries.
The data is then written to BigQuery using BigQuery subscription. The datatype for the BigQuery table is set using Pub/Sub schema. Thus, the change stream data can be captured and written to BigQuery using the BigQuery subscription capability of Pub/Sub.

Further reading

  1. A data pipeline for MongoDB Atlas and BigQuery using Dataflow.
  2. Setup your first MongoDB cluster using Google Marketplace.
  3. Run analytics using BigQuery using BigQuery ML.
  4. How to publish a message to a topic with schema.

Copy Link
facebook icontwitter iconlinkedin icon
Rate this tutorial
star-empty
star-empty
star-empty
star-empty
star-empty
Related
Podcast

Making Diabetes Data More Accessible and Meaningful with Tidepool and MongoDB


May 16, 2022
Article

Massive Number of Collections


May 31, 2022
Tutorial

Kafka to MongoDB Atlas End to End Tutorial


May 13, 2022
Tutorial

Integrating MongoDB with Amazon Managed Streaming for Apache Kafka (MSK)


May 26, 2022
Table of Contents