Duplicate Key Errors

Juan_Soto1 · April 20, 2022, 1:57pm

Hi team,

In the version 1.7 MongoDB removed the following properties:

max.num.retries
Retries.defer.timeout

We have designed two streams that will update the same document. We are using “UpdateOneBusinessKeyTimestampStrategy”. It is possible that both streams arrive at the same time. Then both will try to insert a document.

Previous version 1.7 the kafka connector will manage this error and retry using the parameters described above.

In this new version it looks like it is not managed by the connector. As Robert says on the blog…

This feature was originally intended to address challenges such as network connection issues. Since that time the MongoDB drivers have implemented native capabilities that handle retry logic: The Kafka Connector uses the MongoDB Java driver and has retries enabled by default so there are no changes or extra configuration you need to do to enable retry in the Kafka Connector.

Also if we check the documentation https://www.mongodb.com/docs/manual/core/retryable-writes/#duplicate-key-errors-on-upsert. We noticed that the expected behavior is that the diver or the server manages this issue and retries the operation. But this is not the case.

I know that we can manage it using the DLQ but that means that we will lose the order of the message because in DLQ we have the message with a duplicate error key and the following messages of the batch.

I did some tests with a PSA cluster:

Shell. Two shells open. I kill the secondary. Execute two updateOne with upsert and w:majority wtimeout:50000 for example {_id:1, a:1} {_id:1, b:1}. After executing the upserts, the output was {_id:1, a:1, b:1}. The server or the shell manages the duplicate error.
Bulk on shell. Two shells open. I kill the secondary. Execute two bulks with upsert and w:majority wtimeout:50000 for example {_id:1, a:1} {_id:1, b:1}. After executing the upserts, the output was {_id:1, a:1}. One of the shells has a duplicate key error.
Kafka connect. Two streams. I kill the secondary. Execute two streams with UpdateOneBusinessKeyTimestampStrategy and w:majority wtimeout:50000 for example {_id:1, a:1} {_id:1, b:1}. After executing the upserts, the output was {_id:1, a:1}. You can see duplicate error on the logs.

I know that PSA is not a recommended cluster but it is the easy way to reproduce this error and be sure both operations came at the same time.

Please let me know your thoughts.

Regards,

Juan

Ross_Lawley · April 20, 2022, 3:06pm

Hi @Juan!

Unfortunately, this appears to be a Server issue, possibly made worse by the PSA topology. You could replicate it via the shell alone.

I don’t think the Kafka connector is the correct level to report this as it relies on the server and the driver retry process. The error in this scenario would be reported to the DLQ if configured and then it would have to be manually processed - as you’d have to do in the shell or in a driver.

All the best,

Ross