Introduction to Atlas Stream Processing Development

Hubert Nguyen7 min read • Published Mar 22, 2024 • Updated Mar 22, 2024

Stream Processing Atlas

Rate this quickstart

Welcome to this MongoDB Stream Processing tutorial! In this guide, we will quickly set up a coding workflow and have you write and run your first Stream Processing Instance in no time. In a very short time, we'll learn how to create a new stream processor instance, conveniently code and execute stream processors from Visual Studio Code, and simply aggregate stream data, thus opening the door to a whole new field of the MongoDB Atlas developer data platform.

What we'll cover

Prerequisites
Setup
Create a stream processor instance
Set up Visual Studio Code
The anatomy of a stream processor
Let's execute a stream processor!
Hard-coded data in $source declaration
Simplest stream processor
Stream processing aggregation
Add time stamps to the data

Prerequisites

Basic knowledge of the MongoDB Aggregation Pipeline and Query API
Ideally, read the official high-level Atlas Stream Processing overview
A live MongoDB Atlas cluster that supports stream processing
Visual Studio Code + MongoDB for VS Code extension

Setup

Create an Atlas stream processing instance

We need to have an Atlas Stream Processing Instance (SPI) ready. Follow the steps in the tutorial Get Started with Atlas Stream Processing: Creating Your First Stream Processor until we have our connection string and username/password, then come back here.

Don't forget to add your IP address to the Atlas Network Access to allow the client to access the instance.

Set up Visual Studio Code for MongoDB Atlas Stream Processing

Thanks to the MongoDB for VS Code extension, we can rapidly develop stream processing (SP) aggregation pipelines and run them directly from inside a VS Code MongoDB playground. This provides a much better developer experience. In the rest of this article, we'll be using VS Code.

Such a playground is a NodeJS environment where we can execute JS code interacting with a live stream processor on MongoDB Atlas. To get started, install VS Code and the MongoDB for VS Code extension.

Below is a great tutorial about installing the extension. It also lists some shell commands we'll need later.

Tutorial: Introducing Atlas Stream Processing Support Within the MongoDB for VS Code Extension
Goal: If everything works, we should see our live SP connection in the MongoDB Extension tab.

The anatomy of a stream processor

Within a stream processing instance (=cluster), it is possible to create and run multiple stream processors (=stream processing pipelines) simultaneously.

A single stream processor (SP) is very similar to a MongoDB aggregation pipeline. It is described by an array of processing stages. However, there are some differences. The most basic SP can be created using only its data source (we'll have executable examples next).

Code Snippet

A more realistic stream processor would contain at least one aggregation stage, and there can be a large number of stages performing various operations to the incoming data stream. There's a generous limit of 16MB for the total processor size.

Code Snippet

To increase the development loop velocity, there's an sp.process() function which starts an ephemeral stream processor that won't persist in your stream processing instance.

Let's execute a stream processor!

Let's create basic stream processors and build our way up. First, we need to have some data! Atlas Stream Processing supports several data sources for incoming streaming events. These sources include:

Hard-coded data declaration in $source.
Kafka streams.
MongoDB Atlas databases.

Hard-coded data in $source declaration

For quick testing or self-contained examples, having a small set of hard-coded data is a very convenient way to produce events. We can declare an array of events. Here's an extremely simple example, and note that we'll make some tweaks later to cover different use cases.

Simplest stream processor

In VS Code, we run an ephemeral stream processor with sp.process(). This way, we don't have to use sp.createStreamProcessor() and sp.

.drop() constantly as we would for SPs meant to be saved permanently in the instance.

Code Snippet

Upon running this playground, we should see data coming out in the VS Code "OUTPUT" tab (CTRL+SHIFT+U to make it appear)

Note: It can take a few seconds for the SP to be uploaded and executed, so don't expect an immediate output.

Code Snippet

This simple SP can be used to ensure that data is coming into the SP and there are no problems upstream with our source. Timestamps data was generated at ingestion time.

Stream processing aggregation

Building on what we have, adding a simple aggregation pipeline to our SP is easy. Below, we're adding a $group stage to aggregate/accumulate incoming messages' "value" field into an array for the requested interval.

Note that the "w" stage (w stands for "Window") of the SP pipeline contains an aggregation pipeline inside. With Stream Processing, we have aggregation pipelines in the stream processing pipeline.

This stage features a $tumblingWindow which defines the time length the aggregation will be running against. Remember that streams are supposed to be continuous, so a window is similar to a buffer.

interval defines the time length of a window. Since the window is a continuous data stream, we can only aggregate on a slice at a time.

idleTimeout defines how long the $source can remain idle before closing the window. This is useful if the stream is not sustained.

Code Snippet

Let it run for a few seconds, and we should get an output similar to the following. $group will create one document per incoming "id" field and aggregate the relevant values into a new array field, "values."

Code Snippet

Depending on the $tumblingWindow settings, the aggregation will output several documents that match the timestamps. For example, these settings...

Code Snippet

...will yield the following aggregation output:

Code Snippet

See how the windowStartTimestamp and windowEndTimestamp fields show the 10-second intervals as requested (14:51:30 to 14:51:40 etc.).

Additional learning resources: building aggregations

Atlas Stream Processing uses the MongoDB Query API. You can learn more about the MongoDB Query API with the official Query API documentation, [free] interactive course, and tutorial.

Important: Stream Processing aggregation pipelines do not support all database aggregation operations and have additional operators specific to streaming, like $tumblingWindow. Check the official Stream Processing aggregation documentation.

Add timestamps to the data

Even when we hard-code data, there's an opportunity to provide a timestamp in case we want to perform $sort operations and better mimic a real use case. This would be the equivalent of an event-time timestamp embedded in the message.

There are many other types of timestamps if we use a live Kafka stream (producer-assigned, server-side, ingestion-time, and more). Add a timestamp to our messages and use the document's "timeField" property to make it the authoritative stream timestamp.

Code Snippet

At this point, we have everything we need to test new pipelines and create proofs of concept in a convenient and self-contained form. In a subsequent article, we will demonstrate how to connect to various streaming sources.

Tip and tricks

At the time of publishing, Atlas Stream Processing is in public preview and there are a number of known Stream Processing limitations that you should be aware of, such as regional data center availability, connectivity with other Atlas projects, and user privileges.

When running an ephemeral stream processor via sp.process(), many errors (JSON serialization issue, late data, divide by zero, $validate errors) that might have gone to a dead letter queue (DLQ) are sent to the default output to help you debug.

For SPs created with sp.createStreamProcessor(), you'll have to configure your DLQ manually. Consult the documentation for this. On the "Manage Stream Processor" documentation page, search for "Define a DLQ."

After merging data into an Atlas database, it is possible to use existing pipeline aggregation building tools in the Atlas GUI's builder or MongoDB Compass to create and debug pipelines. Since these tools are meant for the core database API, remember that some operators are not supported by stream processors, and streaming features like windowing are not currently available.

Conclusion

With that, you should have everything you need to get your first stream processor up and running. In a future post, we will dive deeper into connecting to different sources of data for your stream processors.

If you have any questions, share them in our community forum, meet us during local MongoDB User Groups (MUGs), or come check out one of our MongoDB .local events.

References

Rate this quickstart

Article

Atlas Data Lake SQL Integration to Form Powerful Data Interactions

Jun 12, 2023 | 3 min read

Tutorial

How to Deploy MongoDB Atlas with AWS CDK in TypeScript

Jan 23, 2024 | 5 min read

Code Example

Build Your Own Wordle in Bash with the Data API

Aug 26, 2022 | 6 min read

Tutorial

ELT MongoDB Data Using Airbyte

Nov 16, 2022 | 2 min read

What we'll cover
Prerequisites
Setup
The anatomy of a stream processor
Let's execute a stream processor!
Stream processing aggregation
Tip and tricks
Conclusion
References

Atlas

Introduction to Atlas Stream Processing Development

What we'll cover

Prerequisites

Setup

Create an Atlas stream processing instance

Set up Visual Studio Code for MongoDB Atlas Stream Processing

The anatomy of a stream processor

Let's execute a stream processor!

Hard-coded data in $source declaration

Simplest stream processor

Stream processing aggregation

Additional learning resources: building aggregations

Add timestamps to the data

Tip and tricks

Conclusion

References

Related

Atlas Data Lake SQL Integration to Form Powerful Data Interactions

How to Deploy MongoDB Atlas with AWS CDK in TypeScript

Build Your Own Wordle in Bash with the Data API

ELT MongoDB Data Using Airbyte

Table of Contents