GKE Autopilot + Atlas VPC Peering: Intermittent DNS Resolution Failures Despite Successful Java Driver & Shell Connection

We are running our application in a Google Cloud GKE Autopilot cluster with VPC peering configured to a MongoDB Atlas (M10+) cluster. The connection generally works (e.g., through the Java driver and mongosh), but we are facing intermittent DNS resolution issues inside the app pods.

:white_check_mark: Setup Details:

  1. Method of connection: Java Driver
  2. MongoDB Java Driver Version: 4.11.2
  3. Connection String Used:
mongodb+srv://user_name:password@mongodb-7-0-pri.xxxxx.mongodb.net/database_name?retryWrites=true

We’ve also tried with appended parameters:

&authSource=admin&tls=true&appName=mongosh+2.2.6

:firecracker: Problem:

Even though the app is able to connect to MongoDB Atlas via the Java driver and mongosh, it occasionally logs exceptions like:

com.mongodb.MongoSocketException: Unknown host mongodb-7-0-shard-00-01-pri.xxxxx.mongodb.net
com.mongodb.MongoServerSelectionException: Server selection timed out after 30000 ms
com.mongodb.MongoSocketReadException: Prematurely reached end of stream

	Caused by: org.springframework.beans.BeanInstantiationException: Could not instantiate bean class [com.ts.services.app.InitializeDBs]: Constructor threw exception; nested exception is com.mongodb.MongoTimeoutException: Timed out after 30000 ms while waiting for a server that matches ReadPreferenceServerSelector{readPreference=primary}. Client view of cluster state is {type=UNKNOWN, servers=[{address=mongodb-7-0-pri.xxxxx.mongodb.net:27017, type=UNKNOWN, state=CONNECTING, exception={com.mongodb.MongoSocketException: mongodb-7-0-pri.xxxxx.mongodb.net}, caused by {java.net.UnknownHostException: mongodb-7-0-pri.xxxxx.mongodb.net}}]

These issues seem to stem from DNS resolution failures, even though the app is deployed inside a properly VPC-peered GKE environment.


:globe_with_meridians: Network & DNS Info:

GKE Pod IP:** 10.xx.131.xx
GKE Cluster IP Range:** 10.xx.xxx.0/17
Service Type:** LoadBalancer (10.xx.0.207)
Seccomp Profile:** RuntimeDefault
Atlas Access List:** includes full cluster CIDR range
Peering:** Confirmed between GCP VPC and Atlas VPC
DNSPolicy:** Using GKE Autopilot default (ClusterFirst)


:question: Questions:

  1. Are there any known issues with mongodb+srv SRV record resolution inside GKE Autopilot?
  2. Would switching to a non-SRV URI with replica set members help improve reliability? (If yes, could you confirm the best format?)
  3. Are there any timeouts, DNS caching, or JVM-level settings we should tweak in the driver?
  4. What would be the recommended DNS configurations (e.g., dnsPolicy, dnsConfig) in GKE Autopilot to make SRV lookups reliable?
  5. Any troubleshooting steps or best practices for Atlas + GKE peered DNS issues?

Thanks in advance! We’d appreciate any insight into making this connection more stable and eliminating these DNS-related exceptions.