Scaling out the Change Stream watcher/listener

Sam_Lanza · February 27, 2023, 11:38am

I’ve setup a continuous Azure WebJob (without triggers) to act as a watcher for the MongoDB change stream. Everything seems to work fine. However, I’m curious as to how the listener will work if I scaled out the WebJob to multiple instances? Will this cause issues with the MongoDB stream if multiple instances are connecting to the same change stream at the same time? Is there a more elegant way to handle scaling out the watcher or do we have to make it a singleton?

steevej · February 28, 2023, 2:02pm

One way to scale is to distribute the load by watching a different set of operations or documents.

Operations

For example, rather than one process watching of all operations, you have one that listen to inserts, one that listen for deletes and a third that listen to updates. You can then run one or more instances of each depending of your pattern.

Documents

You may also share the load by watching different documents. For example, you may use one of the field of the ObjectId like timestamp to watch to odd timestamp and a second one that watches even timestamp. So each watcher will receive notifications for a different set of documents.

Darshan_Bangre · March 2, 2023, 10:37pm

We’re running into a similar situation. We have a java service that can horizontally scale watching a collection, is there a way to achieve true scaling with mongodb change streams? Having a logic based on timestamp or operation type feels hacky instead of a clean implementation.

steevej · March 3, 2023, 2:43am

For a non-hacky solution, kafka can always be used.

Darshan_Bangre · March 3, 2023, 2:57am

Right, we’re using mongodb sink connector to sink the events from Kafka to mongodb. These events are then processed by horizontally scaled service via change streams.

Darshan_Bangre · March 3, 2023, 3:33pm

@steevej how does kafka solve horizontal scaling of change stream processing by multiple instances of an application?

steevej · March 3, 2023, 3:59pm

From 4. Kafka Consumers: Reading Data from Kafka - Kafka: The Definitive Guide [Book]

Kafka consumers are typically part of a consumer group. When multiple consumers are subscribed to a topic and belong to the same consumer group, each consumer in the group will receive messages from a different subset of the partitions in the topic.

Written differently, each instance of the application will receive a different change stream event.

Darshan_Bangre · March 3, 2023, 4:02pm

thanks, we’re already sourcing the data from kafka to mongodb via kafka sink connector and the application is receiving the changes via mongodb changestreams so, we need to scale the stream without having another messaging layer

steevej · March 3, 2023, 7:58pm

If your data already transit via kafka, why don’t you simply skip the change stream altogether and let your application instances be consumers part of a different group. So what ever is received by the sink connector it is also received by your application. Completely removing the change stream and the need to scale it.

That would reduce the work load on the mongod server. It would also reduce latency since your application will get the data faster.

Darshan_Bangre · March 3, 2023, 8:29pm

That’s a good solution but the application is joining a few collections (avoiding the usage of kafka streams and joins directly on topics) in mongodb before pushing the data into upstream. Also, the kafka messages do not contain the entire event so the mongodb will be updated with the changes first and then change stream provides the latest state of entire record.