Introducing MongoDB Connector for Apache Kafka version 1.9

Robert Walters
January 12, 2023

Today, MongoDB released version 1.9 of the MongoDB Connector for Apache Kafka! This article highlights the key features of this new release!

Pre/Post document states

In MongoDB 6.0, Change Streams added the ability to retrieve the before and after state of an entire document. To enable this functionality on the collection you can set it as a parameter in the createCollection command such as:

db.createCollection(
   "temperatureSensor",
   { changeStreamPreAndPostImages: { enabled: true } }
)

Alternatively, for existing collections, use colMod as shown below:

db.runCommand( {
   collMod: <collection>,
   changeStreamPreAndPostImages: { enabled: <boolean> }
} )

Once the collection is configured for pre and post images, you can set the change.stream.full.document.before.change source connector parameter to include this extra information in the change event.

For example, consider this source definition:

{
  "name": "mongo-simple-source",
  "config": {
    "connector.class": "com.mongodb.kafka.connect.MongoSourceConnector",
    "connection.uri": "<< MONGODB CONNECTION STRING >>",
    "database": "test",
    "collection": "temperatureSensor",
    "change.stream.full.document.before.change":"whenavailable"
  }
}

When the following document is inserted:

db.temperatureSensor.insertOne({'sensor_id':1,'value':100})

Then an update is applied:

db.temperatureSensor.updateOne({'sensor_id':1},{ $set: { 'value':105}})

You can see the change stream event written to Kafka topic is as follows:

{
  "_id": {
    "_data": "82636D39C8000000012B022C0100296E5A100444B0F5E386F04767814F28CB4AAE7FEE46645F69640064636D399B732DBB998FA8D67E0004"
  },
  "operationType": "update",
  "clusterTime": {
    "$timestamp": {
      "t": 1668102600,
      "i": 1
    }
  },
  "wallTime": {
    "$date": 1668102600716
  },
  "ns": {
    "db": "test",
    "coll": "temperatureSensor"
  },
  "documentKey": {
    "_id": {
      "$oid": "636d399b732dbb998fa8d67e"
    }
  },
  "updateDescription": {
    "updatedFields": {
      "value": 105
    },
    "removedFields": [],
    "truncatedArrays": []
  },
  "fullDocumentBeforeChange": {
    "_id": {
      "$oid": "636d399b732dbb998fa8d67e"
    },
    "sensor_id": 1,
    "value": 100
  }
}

Note the fullDocumentBeforeChange key includes the original document before the update occurred.

Starting the connector at a specific time

Prior to version 1.9, when the connector starts as a source, it will open a MongoDB change stream and any new data will get processed by the source connector. To copy all the existing data in the collection first before you begin processing the new data, you specify the “copy.existing” property. One frequent user request is to start the connector based upon a specific timestamp versus when the connector starts. In 1.9 a new parameter called startup.mode was added to specify when to start writing data.

startup.mode=latest (default)

“Latest” is the default behavior and starts processing the data when the connector starts. It ignores any existing data when the connector starts.

startup.mode=timestamp

“timestamp” allows you to start processing at a specific point in time as defined by additional startup.mode.timestamp.* properties. For example, to start the connector from 7AM on November 21, 2022, you set the value as follows:

startup.mode.timestamp.start.at.operation.time=’2022-11-21T07:00:00Z’

Supported values are an ISO-8601 format string date as shown above or as a BSON extended string format.

startup.mode=copy.existing

Same behavior as the existing as the configuration option, “copy.existing=true”. Note that “copy.existing” as a separate parameter is now deprecated. If you defined any granular copy.existing parameters such as copy.existing.pipeline, just prepend them with “startup.mode.copy.existing.” property name.

Reporting MongoDB errors to the DLQ

Kafka supports writing errors to a dead letter queue. In version 1.5 of the connector, you could write all exceptions to the DLQ through the mongo.error.tolerance=’all’. One thing to note was that these errors were Kafka generated errors versus errors that occurred within MongoDB. Thus, if the sink connector failed to write to MongoDB due to a duplicate _id error, for example, this error wouldn’t be written to the DLQ. In 1.9, errors generated within MongoDB will be reported to the DLQ.

Behavior change on inferring schema

Prior to version 1.9 of the connector, if you are inferring schema and insert a MongoDB document that contains arrays with different value data types, the connector is naive and would simply set the type for the whole array to be a string. For example, consider a document that resembles:

{
    "myfoo": [
      {
        "key1": 1
      },
      {
        "key1": 1,
        "key2": "dogs"
      }
    ]
  }

If we set output.schema.infer.value. to true on a source connector, the message in the Kafka Topic will resemble the following:

…
"fullDocument": {
…
    "myfoo": [
      "{\"key1\": 1}",
      "{\"key1\": 1, \"key2\": \"dogs\"}"
    ]
  },
…

Notice the array items contain different values. In this example, key1 is a subdocument with a single value the number 1, the next item in the “myfoo” array is a subdocument with the same “key1” field and value of an integer, 1, and another field, “key 2” that has a string as a value. When this scenario occurs the connector will wrap the entire array as a string. This behavior can also apply when using different keys that contain different data type values.

In version 1.9, the connector when presented with this configuration will not wrap the arrays, rather it will create the appropriate schemas for the variable arrays with different data type values.

The same document when run in 1.9 will resemble:

 "fullDocument": {
    …
    "myfoo": [
      {
        "key1": 1,
      },
      {
        "key1": 1,
        "key2": "DOGS"
      }
    ]
  },

Note that this behavior is a breaking change and that inferring schemas when using arrays can cause performance degradation for very large arrays using different data type values.

Download the latest version of the MongoDB Connector for Apache Kafka from Confluent Hub!

To learn more about the connector read the MongoDB Online Documentation.

Questions? Ask on the MongoDB Developer Community Connectors and Integrations forum!

← Previous

Top 3 Wins and Wants from the Latest TDWI Modernization Report

We recently reported that analyst and research firm TDWI had released its latest report on IT modernization: Maximizing the Business Value of Data: Platforms, Integration, and Management . The report surveyed more than 300 IT executives, data analysts, data scientists, developers, and enterprise architects to find out what their priorities, objectives, and experiences have been in terms of IT modernization. In many ways, organizations have made great progress. From new data management and data integration capabilities to smarter processes for higher business efficiency and innovations, IT departments have helped organizations get more value from the data they generate. In other cases, organizations are still stuck in data silos and struggling with improving data quality as data distribution increases due to the proliferation of multi-cloud environments. In this article, we'll summarize the top three areas where organizations are winning and the top three ways that organizations are left wanting when it comes to digital transformation and IT modernization. Download the complete report, Maximizing the Business Value of Data: Platforms, Integration, and Management , and find out the latest strategies, trends, and challenges for businesses seeking to modernize. Wins 1. Cloud migration Moving legacy applications to the cloud is essential for organizations seeking to increase operational efficiency and effectiveness, generate new business models through analytics, and support automated decision-making — the three biggest drivers of modernization efforts. And, most organizations are succeeding. Seventy-two percent of respondents in the TDWI survey reported being very or somewhat successful moving legacy applications to cloud services. Migrating to the cloud is one thing, but getting data to the right people and systems at the right time is another. For organizations to get full value of their data in the cloud, they also need to ensure the flow of data into business intelligence (BI) reports, data warehouses, and embedded analytics in applications. 2. 24/7 operations The ability to run continuous operations is a widely shared objective when organizations take on a transformation effort. Increasingly global supply chains, smaller and more dispersed office locations, and growing international customer bases are major drivers of 24/7 ops. And, according to the TDWI survey, more than two-thirds of organizations say they've successfully transitioned to continuous operations. 3. User satisfaction Organizations are also winning the race to match users' needs when provisioning data for BI, analytics, data integration, and the data management stack. Eighty percent of respondents said their users were satisfied with these capabilities. Additionally, 72% trusted in the quality of data and how it's governed, and 68% were satisfied that role-based access controls were doing a good job of ensuring that only authorized users had access to sensitive data. Wants 1. Artificial intelligence, machine learning, and predictive intelligence Machine learning (ML) and artificial intelligence (AI) comprise a key area where organizations are left wanting. While 51% of respondents were somewhat or very satisfied with their use of AI and ML data, almost the same number (49%) said they were neither satisfied nor dissatisfied, somewhat dissatisfied, or very dissatisfied. Similar results were also reported for data-driven predictive modeling. The report notes that provisioning data for AI/ML is more complex and varied than for BI reporting and dashboards, but that cloud-based data integration and management platforms for analytics and AI/ML could increase satisfaction for these use cases. 2. More value from data Perhaps related to the AI/ML point, the desire to get more value out of their data was cited as the biggest challenge organizations face by almost 50% of respondents. Organizations today capture more raw, unstructured, and streaming data than ever, and they're still generating and storing structured enterprise data from a range of sources. One of the big challenges organizations reported is running analytics on so many different data types. According to TDWI, organizations need to overcome this challenge to inform data science and capitalize modern, analytics-infused applications . 3. Easier search A big part of extracting more value from data is making it easy to search. Traditional search functionality, however, depends on technically challenging SQL queries. According to the TDWI report, 19% of users were dissatisfied with their ability to search for data, reports, and dashboards using natural language. Unsurprisingly, frustration with legacy technologies was cited as the third biggest challenge facing organizations, according to the survey. The way forward "In most cases, data becomes more valuable when data owners share data," the TDWI report concludes. Additionally, the key to making data more shareable is moving toward a cloud data platform , one that makes data more available while simultaneously governing access when there's a need to protect the confidentiality of sensitive data. Not only does a cloud data platform make data more accessible and shareable for users, it also creates a pipeline for delivering data to applications that can use it for analytics, AI, and ML. Read the full TDWI report: Maximizing the Business Value of Data: Platforms, Integration, and Management .

January 11, 2023

Next →

That’s a Wrap: MongoDB’s 2025 in Review & 2026 Predictions

It’s nearly the end of the year—again! That means it’s time for an end-of-year blog post that expresses disbelief at the passage of time. Which, as the saying goes, flies when you’re having fun. And definitely when you’re as busy as MongoDB was in 2025. It was a big year for the company—and more importantly, for the tens of thousands of customers and millions of developers who rely on MongoDB’s modern data platform for their most mission-critical workloads. At MongoDB, everything we do starts with our obsession with customers and their needs, and if there’s a theme to MongoDB’s 2025, it was (and will continue to be) enabling customer innovation and helping them succeed in the AI era. So here are a few highlights of how MongoDB acted on behalf of customers in 2025. From the acquisition of Voyage AI to customer success across industries, a lot happened in 2025. Let’s go!* *Read to the end for 2026 thoughts. 2025: The (MongoDB) year that was Voyage AI, modernization, and search In February, MongoDB announced the acquisition of Voyage AI, a pioneer in embedding and reranking models, to enhance the accuracy of AI applications. Integrating Voyage AI's advanced retrieval technology with MongoDB’s modern, AI-ready data platform addresses a critical challenge: LLM model hallucinations caused by a lack of context. By improving retrieval accuracy for specialized domains like finance and law, the integration enables businesses to deploy AI for mission-critical use cases. To learn more, see the MongoDB Voyage AI page. Then, in September, we launched MongoDB AMP, an AI-powered Application Modernization Platform. AMP is designed to accelerate the transformation of legacy applications through a combination of AI-powered tooling, a proven delivery framework, and expert guidance (tools, techniques, and talent) to help enterprises reduce technical debt and modernize 2-3 times faster. Want more? Sure you do! Check out this short video. MongoDB also announced the addition of search and vector search capabilities to MongoDB Community Edition and MongoDB Enterprise Server. This allows developers to build and test AI-native applications, including those using retrieval-augmented generation (RAG), in local or on-premises environments. Previously exclusive to MongoDB Atlas, these features enable secure, hybrid deployments where sensitive data can remain on-premises while still leveraging advanced search tools. Here’s a (slightly less short) video about search and vector search on Enterprise Server. Growing and scaling with MongoDB As noted, everything we do at MongoDB starts with our obsession with customers. 2025 was another banner year for customer success and innovation—we were inspired by what organizations of every shape and size, across industries and geographies, built with MongoDB in 2025. Here are just two of the many stories our customers shared in 2025; much more can be found in my colleague Katie Palmer’s blog series, Innovating with MongoDB. Factory By combining the Atlas modern data platform with Voyage AI’s high-performance embeddings, the AI-native startup Factory—which uses AI agents called Droids to accelerate software development lifecycles for organizations—consolidated its fragmented tech stack. This enabled superior code retrieval, simplified operations, and provided the scalability needed to process billions of tokens daily. McKesson McKesson, a global pharmaceutical distributor, replaced its monolithic legacy infrastructure with MongoDB Atlas to meet strict drug tracing mandates. By adopting our modern cloud data platform, McKesson scaled its operations 300x, managing tracking data for 1.2 billion containers annually without latency, and ensuring compliance and patient safety while reducing developer complexity. For more, check out the video of McKesson at MongoDB.local NYC from September. From niche NoSQL to enterprise powerhouse As senior MongoDB engineer and Technical Fellow Ashish Kumar put it earlier this year, “through a sustained and deliberate engineering effort,” MongoDB has gone from a (seemingly) niche NoSQL solution to a trusted enterprise standard, and now delivers “the high availability, tunable consistency, ACID transactions, and robust security that enterprises demand.” A new era of leadership The face of MongoDB has also changed—our CFO, Mike Berry, joined the company in April, and Dev Ittycheria stepped down as CEO in November, after more than 11 years leading the company (including its 2017 IPO). In a LinkedIn post about his role, new MongoDB CEO CJ Desai noted that the company is “at the forefront of a new data revolution, unlocking the next wave of productivity and intelligence.” “Having spent my career building and scaling technology platforms, I’ve always been drawn to companies defined by clarity of vision, relentless organic innovation, and a customer-first culture. MongoDB exemplifies all three,” said Desai. We couldn’t agree more. Onward! Reading the 2026 tea leaves So what might 2026 bring (for MongoDB and tech at large)? Here are a handful of our leaders’ predictions: “As much as people want to talk about Artificial General Intelligence (AGI), we’re still in the phase where most AI use cases automate redundant tasks but benefit from human-in-the-loop checks. Organizations that use AI to complete work that historically is a drain on human resources—but then uses people to carefully verify what AI builds, apply governance frameworks, and maintain accountability across the data lifecycle—will be more successful.” —Pete Johnson, Field CTO, AI, MongoDB “After years of inflated expectations and unsustainable spending, the AI industry is trapped in a bubble where companies reflexively attempt to deploy LLMs at every problem, driving up costs with minimal to no return. Businesses that break free from this spending cycle are the ones that understand the need to ground LLM responses in factual data and learn from prior mistakes. We believe the best way to do this will be with highly accurate embedding models and rerankers for reliable data retrieval.” —Frank Liu, Staff Product Manager, MongoDB "In 2026, cloud independence will evolve from strategic preference to existential imperative across enterprises of every scale. The outages and disruptions of recent years have exposed a fundamental truth: in an always-on digital economy—where commerce, mobility, governance, and even public safety depend on uninterrupted access to cloud services—single-provider reliance is no longer a calculated risk, but a systemic vulnerability. Compounding this is the inexorable rise of data sovereignty. Regulatory regimes worldwide now demand precise jurisdictional control over data residency, rendering rigid cloud commitments incompatible with compliance at global scale. The defining competitive advantage will belong to organizations that transcend fragile prevention theater and engineer true infrastructural resilience: architectures inherently portable, data frictionlessly mobile, and operations autonomously sustained across heterogeneous clouds through AI-orchestrated redundancy. In short, the winners will not merely mitigate downtime—they will design systems that render the concept obsolete." —Ben Cefalo, SVP, Head of Core Products, MongoDB Happy holidays and happy New Year, everyone!

December 22, 2025