Java driver connection lifecycle is taking 5-10 seconds to complete

Pradeepp_Sahnii · May 16, 2022, 9:53am

This is connection lifecycle log

2022-05-16T03:24:49.382Z : Connection check out started for server
2022-05-16T03:24:49.382Z : Connection connectionId{localValue:3902} checked out
2022-05-16T03:24:49.382Z : Sending command '{"find":.......' on connection [connectionId{localValue:3902}]
2022-05-16T03:24:49.384Z : Execution of command with request id <id> completed successfully in 1.75 ms on connection [connectionId{localValue:3902}] 

2022-05-16T03:24:54.392Z : Received batch of 1 documents with cursorId 0 from server
2022-05-16T03:24:54.392Z : Checked in connection [connectionId{localValue:3902}] to server

Command execution itself was done in 1.75ms which has the single document … so what’s happening after command execution and before connection checkin which is taking 5s and sometimes 10s. … There is no GetMore command executed… first batch itself has the result.

driver = Java sync driver 4.6.0
server = 4.0.5

Jeffrey_Yemin · May 16, 2022, 6:43pm

Hi @Pradeepp_Sahnii I don’t have a hypothesis yet that could explain this, so allow me to ask a few questions:

What sort of deployment is it (standalone, replica set, sharded, serverless)?
Can you reproduce this consistently or does it only happen inconsistently or rarely?
Can you provide a reproducer (i.e. sample code that reproduces the problem, even if it doesn’t reproduce always)?
Is there any reason to suspect something in the JVM (e.g. a GC pause, processor contention, etc)?

Thanks,
Jeff

Pradeepp_Sahnii · May 17, 2022, 9:33pm

@Jeffrey_Yemin

this is sharded cluster… And we are connecting to mongos (3 instances)
it happens on a daily basis but randomly, no pattern at all.
we get this issue on any command execution e.g find, delete, create etc.

e.g This is what we are using for find by id ... and sometimes we see issue in this sometimes in other commands

 BsonDocument filter = new BsonDocument()
    .append("field1", new BsonString("value1))
    .append("field2", new BsonString("value2));

FindIterable<CustomPojo> findIterable = myCollection.find(filter);
return findIterable.first();

i was looking at CPU usage, heap usage and direct buffers usage. everything looks normal…no spikes…no high utilization. GC pauses were also normal.

I was looking at the flow of find command execution, it looks like after the command execution event … driver releases the response buffers to the PowerOfTwoBufferPool … and there it acquires lock/permits … is it possible that release buffer is causing these intermittent pauses ?

steevej · May 17, 2022, 10:22pm

Please share a little bit more code.

How your find is called?

How your connection is established?

How is myCollection initialized?

From your lifecycle log, it looks like the command is sent right away when the connection is established. What is happening before

Is there anything happening after

Could you explain a little bit more about the lifecycle? How is the process doing the find started? What fires the find command?

Jeffrey_Yemin · May 17, 2022, 10:41pm

@Pradeepp_Sahnii can you give me a sense of your Java driver upgrade history? Was this application running on previous driver releases, and if so, after what upgrade did you first start seeing this issue? I’m trying to narrow down whether this is a regression introduced by a recent (or not so recent) release of the driver.

Thanks,
Jeff

Pradeepp_Sahnii · May 20, 2022, 11:12am

@Jeffrey_Yemin Few more details

earlier we were using 3.11.0 now we have upgraded it to 4.60, however we were seeing this issue in 3.11.0 as well.
we are seeing this issue only for app server deployed in Azure … it works fine in app servers in AWS.
connection pool settings
1. min-size=50
2. max-size=100
3. connection-idle-time=60s
4. connection-life-time=0
5. maintenance-frequency=60s

Thread pool which executes the mongo commands has max size 150, core size=100.

Jeffrey_Yemin · May 20, 2022, 12:49pm

I don’t currently have a hypothesis for a driver issue that could explain what you’re observing. The 3.11 driver is widely used, and no one else has ever reported this sort of behavior. Given that, it seems likely that the issue is somewhat unique to your particular application and deployment.

Please let us know though if you uncover any additional evidence that would point towards a driver issue.

Regards,
Jeff