Connection(<project-id>.mongodb.net:27017) unable to write wire message to network: write tcp <some-ip> -> <mongo atlas ip>: write: broken pipe

We are frequently experiencing broken pipe errors with our Serverless MongoDB Atlas instance. We have a golang application hosted in GCP (Cloud Run container) and connect from there to the MongoDB instance. It’s working fine in one environment with out any issues, but in another environment we are getting a few broken pipe errors a day (maybe 0.1 - 1 % of requests). Usually a retry succeeds directly after.

Any idea what the reason might be? I don’t think there is a way to access the logs on an Atlas serverless instance, right? Wondering if we need to set specific connection parameters to prevent these issues.

1 Like

Hey Jan! How is it going?

I’m having the same issue :open_mouth:

I don’t have a solution for this apart of changing the database type or applying try and catches everywhere the database client fires a query - sad -. However, I wanted to share my thoughts in case somebody else faces the same issue.

In my case, our team is running a NodeJs container + Prisma.io ORM to handle the database work over a GCP Cloud Run instance too.

This issue arises very consistently when multiple queries hit the db serverless instance - around 3 read consecutive/parallel operations would trigger it fairly easily and consistently -, causing the same Broken pipe error you have found.

My hypothesis is that the issue lies on how serverless instances are handling connections and its consequent scaling/provisioning whenever a query tries to hit the db. Looking at the charts mongodb atlas provides, I found that that error matches a connection drop on the database side (see the snapshot attached).

This connection, of course, would have been closed by the db itself. Open thoughts:

1.- GCP Cloud Run runs containers, which means that connections pools would be set and used/reused as long as the container is active and running while using MongoClient. Not knowing what specifically triggers a mongodb serverless instantiation, leaves me thinking if those reused connections could be causing this if, for example, mongodb serverless expect 1 query = 1 connection.

2.- In my case, 3 queries are being fired, where the third one more often is the one that gets this Broken Pipe error. I wonder if queries 1 and 2 gets its own connection while the latter gets, let’s say, connection 1 reused and that could cause mongodb serverless to close the connection (e.g. 1 connection = 1 response and then closes).

3.- I haven’t found documentation on any of this nor options to configure the scaling behaviour of mongodb serverless instances. I’ve read that min connections could be set at the client side on this doc https://www.mongodb.com/docs/manual/reference/connection-string/#mongodb-urioption-urioption.minPoolSize

4.- Nevertheless, that’s an option for the client side and I don’t know if that could impact somehow the behaviour of mongodb serverless scaling (one way i think it could affect tho, is if the connections on the pool gets to 0 on the client for X ms while opening a new one and that somehow indicates mongodb serverless instance to scale down, dropping the connections and causing the consequent error)

It’d be nice to have someone form mongodb to give some light on these issues. I know that serverless instances are now in preview, so we’re aware stuff like this can happen.

In the meantime, to stays in the safe side, we’ve moved prod to a shared cluster, and it’s working like a charm.

Hope this helps get the discussion moving :raised_hands:

MongoDB Go Driver engineer here:
It’s expected that the Atlas Serverless infrastructure will sometimes close in-use network sockets. We typically expect that the retryable reads and retryable writes behaviors of MongoDB drivers should allow any in-progress operations to be retried on another connection automatically.

@Jan-Gerrit_Harms or @Ian_Sebastian, do you know if retryable reads and writes are enabled in your MongoDB driver configurations? Note that retryable reads/writes are enabled by default in all recent versions of the MongoDB Go Driver (and should be in all official MongoDB drivers).

Thanks for you replies, both of you!

@Matt_Dale We don’t explicitly disable retryable reads and write as far as I know. This is how we establish a connection:

	serverAPIOptions := options.ServerAPI(options.ServerAPIVersion1)
	clientOptions := options.Client().
		ApplyURI(viper.GetString(config.MongoDBUrl)).
		SetServerAPIOptions(serverAPIOptions).
        SetMinPoolSize(viper.GetUint64(config.MongoDBMinPoolSize))

	client, err := mongo.Connect(context.Background(), clientOptions)
	if err != nil {
		log.Fatal().Err(err).Msg("unable to connect to mongodb")
	}
	err = client.Ping(context.Background(), nil)
	if err != nil {
		log.Fatal().Err(err).Msg("not connected to mongo")
	}

We even try to enable it for the writes. Our connection string has the following parameters:
retryWrites=true&w=majority.

Maybe we should also explicitly enable the retryReads, although if I understand you correct, it should be enabled by default.

@Jan-Gerrit_Harms @Ian_Sebastian

I would also advise bringing this up with the Atlas chat support team. They may be able to check if anything else on the Atlas side could have possibly caused this broken pipe message. In saying so, if a chat support is raised, please provide them with the following:

  1. Cluster link / name which experienced the issue
  2. Time & date including timezone for when it occurred
  3. Exact error message output
  4. Driver language and version

Regards,
Jason

Hey @Jason_Tran @Matt_Dale !

First of all, thanks for your answers :smiley: . I haven’t had time to come back to this lately, so has been a while here.

Coming back to this, in my case we are using prisma.io ORM to handle the database connections, which uses latest RUST native mongo driver to handle the connections in the back.

As per this tread, it seems that everything prisma does it does it using transactions on aggregation pipelines, which could explain the lack of retries - and writeConflicts we’ve been seen around it -.

One question I’d like to ask, is if the behavior of retries indeed changes when using transactions. As per the documentation shared by @Matt_Dale, it might seems so. Would that be correct ? does also RUST driver differ in this behavior in some way or another that could be affecting the retry behavior ?

Thanks a lot!