MongoDB Kafka source connector pipeline for multiple collections isn't working

AGorshkov · August 25, 2021, 3:02pm

Hello! We’re trying to get messages from the three collections in one DB via one connector. Pipeline in our config is similar to a documentation:

{
  "name": "<connector_name>",
  "config": {
    "connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
    "batch.size": "1000",
    "transforms": "dropPrefix",
    "database": "<db_name>",
    "collection": "",
    "copy.existing.pipeline": "[{\"$match\": {\"ns.coll\": {\"$regex\": /^(\"<collection_1>|<collection_2>|<collection_3>\")$/}}}]",
    "pipeline": "[{\"$match\": {\"ns.coll\": {\"$regex\": /^(\"<collection_1>|<collection_2>|<collection_3>\")$/}}}]",
    "key.converter.schemas.enable": "false",
    "output.json.formatter": "com.mongodb.kafka.connect.source.json.formatter.SimplifiedJson",
    "connection.uri": "<connection_uri>",
    "name": "<connector_name>",
    "topic.creation.default.partitions": "3",
    "topic.creation.default.replication.factor": "3",
    "value.converter.schemas.enable": "false",
    "transforms.dropPrefix.type": "org.apache.kafka.connect.transforms.RegexRouter",
    "transforms.dropPrefix.replacement": "<topic_name>",
    "transforms.dropPrefix.regex": "(.*)<db_name>(.*)",
    "copy.existing": "true",
    "value.converter": "org.apache.kafka.connect.storage.StringConverter",
    "key.converter": "org.apache.kafka.connect.storage.StringConverter"
  }
}

Connector starts successfully, but that’s all, no messages are coming to the topic. Can anyone tell us what exactly we are doing wrong, please?

Robert_Walters · August 25, 2021, 5:02pm

Is there anything in the kafka connect log?

if you remove the pipeline, copy.existing.pipeline does it capture events?

also try removing “collection”:"" since you only need to specify the database in your scenario.

AGorshkov · August 25, 2021, 5:37pm

Thank You for response! Answering your questions:

No, nothing suspicious, connector successfully connected to the DB, but that’s all.
If removing “pipeline” and “copy.existing.pipeline” parameters connector successfully captures events from specified collection.
Already tried it, nothing changed.

Robert_Walters · August 25, 2021, 10:28pm

let’s start with the minimum connector config and go from there,

name”: “<connector_name>”,
“config”: {
“connector.class”: “com.mongodb.kafka.connect.MongoSourceConnector”,
“database”: “<db_name>”,
“connection.uri”: “<connection_uri>”,
“name”: “<connector_name>”,
“value.converter”: “org.apache.kafka.connect.storage.StringConverter”,
“key.converter”: “org.apache.kafka.connect.storage.StringConverter”
}

see if that generates events.

AGorshkov · August 26, 2021, 8:57am

After lot of tests the connector with following configuration is working:

    {
      "name" : "<connector_name>",
      "config" : {
        "batch.size" : "1000",
        "connection.uri" : "<connection.uri>",
        "connector.class" : "com.mongodb.kafka.connect.MongoSourceConnector",
        "copy.existing" : "true",
        "database" : "<db_name>",
        "key.converter" : "org.apache.kafka.connect.storage.StringConverter",
        "key.converter.schemas.enable" : "false",
        "name" : "<connector_name>",
        "output.json.formatter" : "com.mongodb.kafka.connect.source.json.formatter.SimplifiedJson",
        "pipeline" : "[   { $match: { \"ns.coll\": { \"$in\": [\"<collection_1>\", \"<collection_2>\", \"<collection_3>\" ] } } } ]",
        "transforms" : "dropPrefix",
        "transforms.dropPrefix.regex" : "(.*)<db_name>(.*)",
        "transforms.dropPrefix.replacement" : "<topic_name>",
        "transforms.dropPrefix.type" : "org.apache.kafka.connect.transforms.RegexRouter",
        "value.converter" : "org.apache.kafka.connect.storage.StringConverter",
        "value.converter.schemas.enable" : "false"
      }
    }

The only major difference is a different pipeline format. So, there’s another question - what can be wrong with the pipeline version from the documentation or there’s another root cause of this issue?

AGorshkov · September 1, 2021, 11:22am

Small update. With the configuration from above after restarting Kafka Connect node because of Out Of Memory issue some kind of topic re-init happens, all historical messages has been re-uploaded to the topic. What could have caused this?
Thank you.

Robert_Walters · September 10, 2021, 1:53pm

you have copy.existing set to true so that will copy all the existing data in the collection before opening the change stream and processing the current events.

AGorshkov · September 22, 2021, 2:55pm

Is there any solution to bypass messages duplication and avoid messages lose, except of using 2 connectors (with copy.existing:true and without)? We need all existing data, but don’t need to duplicate it, because there’re lot of such data and reuploading causes issues.

vasireddy_prasanth · July 12, 2023, 3:59am

HI
i am facing similar thing on source connector config . my config is to collect data from multiple databases and collections from same mongodb host and publish to same topic .

below is config i am using , but getting error

< {
“name” : “mongo-source”,
“config” : {
“batch.size” : “1000”,
“connection.uri” : “mongodb://:@*********************:1025/?ssl”,
“connector.class” : “com.mongodb.kafka.connect.MongoSourceConnector”,
“key.converter”: “org.apache.kafka.connect.storage.StringConverter”,
“value.converter”: “org.apache.kafka.connect.storage.StringConverter”,
“pipeline”: “[ {"$match": {$or: [ {"ns.db": "uat_move5app", "ns.coll": "AccessToken"}, {"ns.db": "uat_move5app", "ns.coll": "Account"}, {"ns.db": "uat_move5challenge", "ns.coll": "Achievement"}, {"ns.db": "uat_move5health", "ns.coll":"AppleRing"}, {"ns.db": "uat_move5app", "ns.coll": "Application"}, {"ns.db": "uat_move5app", "ns.coll": "AuditLog"}, {"ns.db": "uat_move5challenge", "ns.coll": "Badge"}, {"ns.db": "uat_move5challenge", "ns.coll": "Challenge"}, {"ns.db": "uat_move5challenge", "ns.coll": "Code"}, {"ns.db": "uat_move5app", "ns.coll": "Country"}, {"ns.db": "uat_move5challenge", "ns.coll": "Goal"}, {"ns.db": "uat_move5challenge", "ns.coll": "GoalReward"}, {"ns.db": "uat_move5tracker", "ns.coll": "HealthNotification"}, {"ns.db": "uat_move5health", "ns.coll": "HealthSummary"}, {"ns.db": "uat_move5tracker", "ns.coll": "HealthTracker"}, {"ns.db": "uat_move5app", "ns.coll": "Installation"}, {"ns.db": "uat_move5cas", "ns.coll": "HPMember"}, {"ns.db": "uat_move5cas", "ns.coll": "MoveKey"}, {"ns.db": "uat_move5app", "ns.coll": "Muser"}, {"ns.db": "uat_move5challenge", "ns.coll": "Participation"}, {"ns.db": "uat_move5challenge", "ns.coll": "Program"}, {"ns.db": "uat_move5notification", "ns.coll": "PushNotification"}, {"ns.db": "uat_move5notification", "ns.coll": "PushResponse"}, {"ns.db": "uat_move5notification", "ns.coll": "PushSubscription"}, {"ns.db": "uat_move5queue", "ns.coll": "QueueError"}, {"ns.db": "uat_move5challenge", "ns.coll": "Reward"}, {"ns.db": "uat_move5app", "ns.coll": "RoleMapping"}, {"ns.db": "uat_move5app", "ns.coll": "Role"}, {"ns.db": "uat_move5queue","ns.coll": "Task"}, {"ns.db": "uat_move5queue", "ns.coll": "TaskConfig"}, {"ns.db": "uat_move5challenge", "ns.coll": "UserBadge"}, {"ns.db": "uat_move5challenge", "ns.coll": "UserCode"}, {"ns.db": "uat_move5challenge", "ns.coll":"UserGoal"}, {"ns.db": "uat_move5challenge", "ns.coll": "UserReward"}, {"ns.db": "uat_move5app", "ns.coll": "UserState"}, {"ns.db": "uat_move5message", "ns.coll": "DestinationMapping"}, {"ns.db": "uat_move5message", "ns.coll": "FollowUpMapping"}, {"ns.db": "uat_move5health-score", "ns.coll": "HealthProfile"}, {"ns.db": "uat_move5health-score", "ns.coll": "HealthScore"}, {"ns.db": "uat_move5health-score", "ns.coll": "HealthScoreDelta"}, {"ns.db": "uat_move5health-score", "ns.coll": "ProviderAccount"}, {"ns.db": "uat_move5health-score", "ns.coll": "SurveyQuestion"}, {"ns.db": "uat_move5message", "ns.coll": "SystemMessage"}, {"ns.db": "uat_move5message", "ns.coll": "UserMessage"},{"ns.db": "uat_move5health-score", "ns.coll": "UserSurvey"}, {"ns.db": "perf_move5edl", "ns.coll": "HealthSummary"}, {"ns.db": "perf_move5edl", "ns.coll": "AppleRing"}, {"ns.db": "perf_move5edl", "ns.coll": "UserReward"}, {"ns.db": "perf_move5edl", "ns.coll": "UserGoal"}, {"ns.db": "perf_move5edl", "ns.coll": "UserState"}, {"ns.db": "perf_move5edl", "ns.coll": "Participation"}, {"ns.db": "perf_move5edl", "ns.coll": "HealthScore"}, {"ns.db": "perf_move5edl", "ns.coll": "Muser"}, {"ns.db": "perf_move5edl", "ns.coll": "Account"} ] } } ]”,
“topic.prefix”: “SG_uat_move5app.Installation”
}
}

Error:
curl -X PUT -H “Content-Type: application/json” --data @./test.json http://localhost:8083/connectors/MongoSourceConnectorV1/config
{“error_code”:500,“message”:“Cannot deserialize value of type java.lang.String from Object value (token JsonToken.START_OBJECT)\n at [Source: (org.glassfish.jersey.message.internal.ReaderInterceptorExecutor$UnCloseableInputStream); line: 1, column: 53] (through reference chain: java.util.LinkedHashMap["config"])”}

Saranyaa_RT · January 20, 2025, 7:55pm

is there a way to use copy.existing.pipeline , while trying to access multiple collections ?
following is my connector configuration
``

connector.class=com.mongodb.kafka.connect.MongoSourceConnector
startup.mode=copy_existing
tasks.max=4
startup.mode.copy.existing.namespace.regex=mydb.(mycol1|mycol2)
change.stream.full.document=updateLookup
heartbeat.interval.ms=100000
topic.prefix=test_27
topic.namespace.map={“mydb.mycol1”: “mycol1” , “mydb.mycol2”: “mycol2” }
database=acko
startup.mode.copy.existing.max.threads=2
startup.mode.copy.existing.pipeline=[{ “$match”:{ “ns.coll”: “mycol1” } }]
connection.uri=mongodb://xxxxx:yyyyyy@10.10.10.10:27017
copy.existing=true
value.converter=org.apache.kafka.connect.storage.StringConverter
key.converter=org.apache.kafka.connect.storage.StringConverter

the connector is running , but not pushing any data .
but works prefectly fine , when we remove copy.existing.pipeline config
please help

T_W3 · July 16, 2025, 9:54pm

Trying something similar with a source connector that feeds into Azure Event Hubs. Below is my configuration in a strimzi yaml. I’ve been testing with just one collection but no success.

config:
connection.uri: <>
offsets.storage.topic: “offsets.hub”
database: “”
startup.mode: “latest”
server.api.version: “4.2”
publish.full.document.only: “true”
heartbeat.interval.ms: “5000”
errors.log.enable: “true”
errors.log.include.messages: “true”
mongo.errors.log.enable: “true”
output.schema.infer.value: “true”
pipeline: “[{"$match":{"operationType":{"$in":["insert","update","replace"]}}}, { "ns.coll": { "$in": [""]}}, {"$project":{"_id":1,"fullDocument.id":1,"ns":1,"documentKey":1}}]”

This is the error I receive in KafkaConnect logs [private info redacted]:

2025-07-16 INFO [<connector_name>|task-0] Watching for database changes on ‘<db_name>’ (com.mongodb.kafka.connect.source.MongoSourceTask) [task-thread-<connector_name>-0]
2025-07-16 INFO [<connector_name>|task-0] New change stream cursor created without offset. (com.mongodb.kafka.connect.source.MongoSourceTask) [task-thread-<connector_name>-0]
2025-07-16 INFO [<connector_name>|task-0] Opened connection [connectionId{localValue:<###>, serverValue:<###>}] to <mongo_db_connection> (org.mongodb.driver.connection) [task-thread-<connector_name>-0]
2025-07-16 WARN [<connector_name>|task-0] Failed to resume change stream: Expected type string but found int. 2

Thanks