Parallelization read from change stream of a single collection by multiple kafka connects

Khanh_Nguyen · August 22, 2021, 3:19am

Hi there,

I have a collection in MongoDB and want to listen changes from this collection by using MongoDB Kafka Connector. Now I have config to get full document:

change.stream.full.document=updateLookup

The problem is that the collection in MongoDB changes really fast. 10k changes/sec and a full record is around 100KB. As I monitor, read speed of kafka connect is just around 1k/sec and it can’t catch changes in MongoDB. It also has the following exception:

com.mongodb.MongoQueryException: Query failed with error code 136 and error message 'Error in $cursor stage :: caused by :: errmsg: "CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(21324343904390409389)"' on server 10.21.31.22:27017

Is there any way to speed (parallelize) it up?

I’m thinking I will have 10 kafka connects, each will listen to a specific query, i.e: I will do config like this:
kafka connect 1:
pipeline= [ match document id in range [a1, b1] ]

kafka connect 2:
pipeline= [ match document id in range [a2, b2] ]

…
kafka connect 10:
pipeline= [ match document id in range [a10, b10] ]

Is it possible?

Robert_Walters · August 23, 2021, 11:18am

Yes that is the recommended way to scale the source (by using multiple connectors with the pipeline defined)

system · August 28, 2021, 11:18am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.