Connection Reset Errors

Seeing an abundance of the below in a sharded cluster environment hosted in AWS. Any insight as to how to debug? Have tinkered with tcp keepalive on the servers (currently set to 120) and maxIdleTime on the client without any noticeable change.

MongoDB Server Version: 4.4.2
Java Driver: 'org.mongodb:mongodb-driver-reactivestreams:1.13.1
‘io.reactivex.rxjava3:rxjava:3.0.3’
Architecture: arm64

"stack_trace":"java.util.concurrent.ExecutionException: com.mongodb.MongoSocketReadException: Exception receiving message
    at java.base/java.util.concurrent.CompletableFuture.reportGet(Unknown Source)
    at java.base/java.util.concurrent.CompletableFuture.get(Unknown Source)
    at com.creativeradicals.openio.pipeline.persist.feed.FeedMultiSaver.accept(FeedMultiSaver.java:107)
    at com.creativeradicals.openio.pipeline.persist.feed.FeedMultiSaver.accept(FeedMultiSaver.java:34)
    at com.creativeradicals.openio.rabbit.base.RequestConsumer.handleDelivery(RequestConsumer.java:42)
    at com.rabbitmq.client.impl.ConsumerDispatcher$5.run(ConsumerDispatcher.java:149)
    at com.rabbitmq.client.impl.ConsumerWorkService$WorkPoolRunnable.run(ConsumerWorkService.java:104)
    at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
    at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
    at java.base/java.lang.Thread.run(Unknown Source)
Caused by: com.mongodb.MongoSocketReadException: Exception receiving message
    at com.mongodb.internal.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:569)
    at com.mongodb.internal.connection.InternalStreamConnection.access$1200(InternalStreamConnection.java:76)
    at com.mongodb.internal.connection.InternalStreamConnection$5.failed(InternalStreamConnection.java:520)
    at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.failed(AsynchronousChannelStream.java:235)
    at com.mongodb.internal.connection.AsynchronousChannelStream$BasicCompletionHandler.failed(AsynchronousChannelStream.java:203)
    at java.base/sun.nio.ch.Invoker.invokeUnchecked(Unknown Source)
    at java.base/sun.nio.ch.Invoker$2.run(Unknown Source)
    at java.base/sun.nio.ch.AsynchronousChannelGroupImpl$1.run(Unknown Source)
    ... 3 common frames omitted
Caused by: java.io.IOException: Connection reset
    at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.finishRead(Unknown Source)
    at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.finish(Unknown Source)
    at java.base/sun.nio.ch.UnixAsynchronousSocketChannelImpl.onEvent(Unknown Source)
    at java.base/sun.nio.ch.EPollPort$EventHandlerTask.run(Unknown Source)
    ... 1 common frames omitted"}

See the discussion containing this suggestion and perhaps that will solve your problem…

Our application uses docker, and it looks like the version of Java we are on (OpenJDK Runtime Environment AdoptOpenJDK (build 14.0.2+12) has patched the bug mentioned in that thread. Is there any other way to debug these constant connection reset errors and socket exceptions?

Hmm, I don’t know an easy way … maybe we can ask @Jeffrey_Yemin

There’s no straightforward way to determine the root cause of connection reset errors. It’s not typically a driver bug that causes it. Rather, it’s either something happening in the MongoDB server or in the network between driver and server. I would look first at MongoDB server logs to see if there are any clues there. It’s possible that the server itself is closing the connection for some reason. If not, you’ll need to involve an expert in network administration, perhaps to employ a tool like Wireshark to figure out what’s happening, assuming that you can reproduce the error.

One other thought: if you’re able to test outside of Docker, that would be one way to rule that Docker itself as a contributing factor.

2 Likes

Yes, simplification is a useful debugging tool.

Our cluster is running on ARM64 CentOS machines in AWS. I see that the compatibility specs don’t list CentOS under ARM64. Is it worth trying to switch the underlying Operating System to Ubuntu?
.
https://docs.mongodb.com/manual/administration/production-notes/

1 Like

Seeing an abundance of these Client Disconnect errors reported by our routers:

message: {“t”:{"$date":“2021-01-19T17:47:02.856+00:00”},“s”:“D1”, “c”:“SHARDING”, “id”:22772, “ctx”:“conn1749”,“msg”:“Exception thrown while processing command”,“attr”:{“db”:“admin”,“headerId”:-697198149,“error”:“ClientDisconnect: operation was interrupted”}} c:SHARDING severity:7 ctx:conn1749 @timestamp:Jan 19, 2021 @ 12:47:22.867 host:mongodb-router-5 s:D1 @timegenerated:Jan 19, 2021 @ 12:47:03.144 host_ip:10.204.0.165 msg:Exception thrown while processing command severity_label:debug type:rsyslog id:22,772 attr.headerId:-697,198,149 attr.error:ClientDisconnect: operation was interrupted attr.db:admin program:mongos @version:1 port:34,092 facility_label:user t.$date:Jan 19, 2021 @ 12:47:02.856 pid:1606 logsource:mongodb-router-5syslogtag:mongos[1606]: facility:1 _id:CqjCG3cB7JtNHvfUX1ig _type:_doc _index:mongodb-2021.01.19 _score: -

Any more ideas? These client disconnects are happening frequently. As i stated we’ve followed the administrative guide to a tee. This is very disruptive to our application.

It does not sound like a MongoDB problem, but rather, a network, hardware, or hypervisor problem with net connectivity. In 2017 I saw something like this and it turned out to be net connectivity between 2 subnets in separate wings of a factory installation. Have your network people looked at this?