Unstable connection between GCP Cloud Run and MongoDB Atlas (2)

Laurens · January 17, 2023, 7:12pm

Hello,

This is following on my previous topic.

We are still having issues with the connection to MongoDB from our GCP Cloud Run service.

Stack:

GCP Cloud Run
Connection set up via VPC Network Peering (requests from Cloud Run to private IPs are routed through the VPC connector)
Node.js (v18) with mongodb driver (v4.10.0)
Connection string “mongodb+srv://user:password@cluster-pri.xxxxx.mongodb.net”
MongoDB Atlas on version 6.0.3 (provider GCP), M30

GCP support team has verified that the configuration on that side is correct. The issue seems to be related to MongoDB.

THE ISSUE
Multiple times per day, we get many errors on our server about the connection with MongoDB. Here are some examples:

VARIANT 1
"MongoNetworkTimeoutError: connection timed out
    at connectionFailureError (/app/node_modules/mongodb/lib/cmap/connect.js:389:20)
    at TLSSocket.<anonymous> (/app/node_modules/mongodb/lib/cmap/connect.js:310:22)
    at Object.onceWrapper (node:events:627:28)
    at TLSSocket.emit (node:events:513:28)
    at TLSSocket.emit (node:domain:489:12)
    at Socket._onTimeout (node:net:568:8)
    at listOnTimeout (node:internal/timers:564:17)
    at process.processTimers (node:internal/timers:507:7) {"

VARIANT 2
"MongoServerSelectionError: connection <monitor> to 192.168.248.2:27017 timed out
    at Timeout._onTimeout (/app/node_modules/mongodb/lib/sdam/topology.js:285:38)
    at listOnTimeout (node:internal/timers:564:17)
    at process.processTimers (node:internal/timers:507:7) {"

VARIANT 3
MongoServerSelectionError: connection 1 to 35.233.114.132:27017 closed
   at .listOnTimeout ( node:internal/timers:564 )
   at process.processTimers ( node:internal/timers:507 )

VARIANT 4
MongoNetworkError: connection 96 to 192.168.248.3:27017 closed
   at .TLSSocket.emit ( node:events:513 )
   at .TLSSocket.emit ( node:domain:489 )
   at undefined. ( node:net:313 )
   at .TCP.done ( node:_tls_wrap:587 )

VARIANT 5
PoolClearedError [MongoPoolClearedError]: Connection pool for production-shard-00-01-pri.xxxxx.mongodb.net:27017 was cleared because another operation failed with: "connection <monitor> to 192.168.248.3:27017 timed out"
   at .Server.emit ( events.js:400 )
   at .Server.emit ( domain.js:475 )
   at .Monitor.emit ( events.js:400 )

VARIANT 6
MongoPoolClearedError: Connection pool for production-shard-00-01-pri.xxxxx.mongodb.net:27017 was cleared because another operation failed with: "connection <monitor> to 192.168.248.3:27017 timed out"
   at .Server.emit ( node:events:513 )
   at .Server.emit ( node:domain:489 )
   at .Monitor.emit ( node:events:513 )

VARIANT 7
PoolClearedOnNetworkError: Connection to production-shard-00-02-pri.xxxxx.mongodb.net:27017 interrupted due to server monitor timeout

The issues happen on our QA and development environment too, but much less frequently due to much lower usage.

In short
Unstable connection between Cloud Run and MongoDB. The connection is closed, or the connection timed out, or the connection pool was cleared.

The IP address of the GCP VPC Network (subnet) is whitelisted on MongoDB side.

What we tried

Using both node:18-alpine and node:18 docker images
Using a Cloud NAT to get a fixed IP and go over public internet: this was even more unstable (see original post)
On our DEV environment, I lowered the minimum TLS Protocol Version from “TLS 1.2 and above” to “TLS 1.0 and above”. It was unclear if this improved the situation. I did not yet try this on our production environment because it is strongly advised by MongoDB to use 1.1+ or 1.2+.

Any help or ideas would be appreciated!

Thanks!

alexbevi · January 30, 2023, 9:36pm

Hi @Laurens,

My name is Alex and I’m a Product Manager on the Developer Experience team at MongoDB. First off, apologies for the delay in responding. We take these matters very seriously as our goal is to ensure the best possible experience for developers working with our tools and interfaces.

We are continuing to work on improving our Drivers to ensure they are as resilient as possible within serverless environments such as Google Cloud Run and AWS Lamba, however on occasion some default values may need to be tuned.

Any help or ideas would be appreciated!

Though it’s difficult to determine without doing a full analysis if the issues are transient network issues, configuration issues, application/workload issues or some other issue, one recommendation we can make is to try setting the maxIdleTimeMS connection string option to 60000 (1 minute).

Some users have reported less frequent connection timeout errors being raised in GCP environments as a result. As each error variant you shared appears to share a common root of a connection timing out, please test with a maxIdleTimeMS=60000 and let us know if this reduces the frequency of the errors you’re experiencing.

Laurens · February 1, 2023, 12:51pm

Hi Alex,

Thank you for your response.

I will try setting the maxIdleTimeMS=60000 and see what happens.
In the meantime, I’ve also discussed with someone via Support Chat and decided to make a Support case for this too. I will mention as well that this fix is being tried out.

Of cousre, I will share my findings here.

Laurens · February 7, 2023, 7:28am

An update from the test with { maxIdleTimeMS: 60000 }:
The issue has not occurred for a couple of days now! This should prove that the solution works. I will mark this as solved and in the case we do see the errors pop up again revisit this thread.

Brad_Beighton · February 7, 2023, 2:45pm

@alexbevi I am experiencing similar, is there a more permanent solution rather than maxIdleTimeMS=60000 ?

alexbevi · February 7, 2023, 5:37pm

@alexbevi I am experiencing similar, is there a more permanent solution rather than maxIdleTimeMS=60000 ?

Hi @Brad_Beighton. For the moment if you’re experiencing this issue adjusting the maxIdleTimeMS should reduce the occurrences of the issue. We are continuing to work on improving the developer experience with our Drivers in environments such as GCP Cloud Run however we do not have a public timeline we can share yet as to what a more permanent solution would be or when it would be available.

system · February 12, 2023, 5:37pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.