Gracefully reconnecting with the Go driver

This is a general question regarding client software written using any of the MongoDB drivers. What is the best practice for handling a disconnection between client and server (due to network issues, downed server, etc.)? I’m using the mongo-go-driver and I’m wondering which sorts of methods/errors I should be using/checking to ensure that my client applications aren’t waiting on infinitely-long queries or unable to use my application without me restarting it (to, hopefully, reconnect to the MongoDB instance/replica set/master).

Hi John,

The driver monitors the deployment using one goroutine per node in the cluster so generally, there should be nothing required on your end to reconnect in the event of a transient issue. For example, when connecting to a replica set, the driver will detect the new primary in the event of a failover or node restart. In addition, the driver will retry certain read and write operations by default in the event of a transient error.

The only thing I can think of checking for that would indicate an issue is a server selection timeout. By default, the driver will try to find a suitable server for an operation for 30 seconds. This is generally enough time for transient network issues to resolve themselves. If your application receives a server selection error, that could be indicative of a more serious issue or a signal that you need to set the server selection timeout to a higher value (e.g. if primary elections are consistently taking 45 seconds, try setting the timeout to 1 minute).

The current driver version (1.3.0 at the time of writing) does not return a custom error type for server selection errors, but you can check for the “server selection error” substring. We have a project planned for this quarter to improve the error types returned by the driver, and I plan on adding a concrete type that users can check for as part of that.

– Divjot

2 Likes

Divjot, that was an amazing answer. Please see if you can get this added to a FAQ on the official MongoDB documentation. It details that 1) I shouldn’t have to worry about this and 2) If I’m having some issue with my network that there exist some hacky solutions to solve the problem of an unusually long connection time. Thanks!

Where should I go to get documentation on the error types/values which I can check for a given version of the mongo-go-driver? Particularly, I’m interested in an error value that I can check to tell if my CreateIndex call is being called on a collection for which that index already exists (ErrIndexExist or similar). Sorry, that this is a derivative topic.

For others: A list of errors can be found here: mongo/error_codes.yml at master · mongodb/mongo · GitHub

Also, the error can be asserted as a mongo.CommandError: Issue of creating index with default name using Go driver - #2 by Divjot_Arora .

Hi @Divjot_Arora,

To achieve: Implement reconnect logic.

With several documents, I have seen two field present in the mongoDB docs for node.
I am unable to find the same variable based configuration in the mongo-go-driver 1.3.1. However as you have mentioned that it has retry mechanism present already, I am unable to implement the same.

As a alternative solution, I came up with the below.

Testing scenario:

  1. Connect to mongo Atlas(free tier account)
var Client *mongo.Client
func ConnectDb() bool {
	duration := // some duration
	ctx, cancel := context.WithTimeout(context.Background(), duration)
	defer cancel()

	monitor := &event.PoolMonitor{
		Event: HandlePoolMonitor,
	}

	client, err := mongo.Connect(ctx,
		options.Client().
			ApplyURI(/*atlas connection uri*/).
			SetMinPoolSize(/*min pool size*/).
			SetMaxPoolSize(/*max pool size*/).
			SetHeartbeatInterval(/* some duration*/).
			SetPoolMonitor(monitor))
	if err != nil {
		return false
	}
	Client = client
	return Ping()
}

func reconnect(client *mongo.Client) {
	for {
		if ConnectDb() {
                        client = Client
			break
		}
		time.Sleep(time.Duration(configuration.AppConfig.MongoDb.ReconnectInterval) * time.Second)
	}
}

func HandlePoolMonitor(evt *event.PoolEvent) {
	switch evt.Type {
	case event.PoolClosedEvent:
		logging.AppLogger.Error("DB connection closed.")
		reconnect(Client)
	}
}

func main() {
  ginRouter.GET("/health", func(context *gin.Context) {
		Ping()
		context.JSON(http.StatusOK, gin.H{
			"status": "System is up and running.",
		})
	})

}
  1. Create a ping function
func Ping() bool {
	if err := Client.Ping(context.TODO(), nil); err != nil {
		return false
	}
	return true
}
  1. Just before the actual call to Ping method in step 2, pause the execution using debugger and call the disconnect method from client(Client.Disconnect(nil)) and continue.

Efforts made to resolve:
I tried using monitor event(PoolClosedEvent) and tried to reconnect.

Problem:
Though this code works but I believe there is a better way provided by you guys for the same considering node.js docs.

Thanks & Regards
Ankush Goyal

Hi @Ankush_Goyal,

Can you answer these questions so I can get a better understanding of what you’re trying to do?

  1. What does “Implement reconnect logic” mean in an actual application? In your example, you manually call Client.Disconnect but I assume a production application wouldn’t disconnect a Client that’s still being used. Also, you’re checking for PoolClosed events, but those only happen when the Client is being disconnected or if someone tries to check out a connection using a previously disconnected Client.

  2. You mentioned you’ve seen two fields present in the Node driver documentation and can’t find the analogous fields for Go. What are these fields? Maybe we can help you figure out how to do the same in Go.

Also, as I mentioned in a previous reply, the driver will automatically handle things like elections and server restarts. Obviously, some operations like writes and reads against the primary may fail if the primary node is unavailable for an extended amount of time, but the driver will find the node again when it comes back up.

One final note: I recommend upgrading driver versions to 1.3.2 or even better to 1.3.3, which was released two days ago. In 1.3.2, we fixed a regression that was introduced in 1.3.0 and caused a deadlock if a connection encountered a network error during establishment. You can find a list of released versions on Releases · mongodb/mongo-go-driver · GitHub.

1 Like

Hi @Divjot_Arora,

For point 1,

  1. Yes I have implemented the reconnect logic. Similar to what is there in code.
  2. The reason I was calling Client.Disconnect was to test reconnection with servers similar to which can happen when DB is down. Is there an alternative way on how I can test whether reconnection is working with atlas or local setup. I dont have any such feature to pause the cluster in Atlas, as a result I was forced to replicate by calling Client.Disconnect.

For point 2,

  1. There are actually three fields ( autoReconnect, reconnectTries, reconnectInterval).

Thanks for the heads on new version release. Will make sure to check it out.

Thanks & Regards
Ankush Goyal

Client.Disconnect isn’t a great way to test this because that’s generally considered a cleanup/tear-down method that closes resources that are in-use and publishes some events like PoolClosed. It’s not really meant to be a transient state that you can “recover” from. If you want to test what happens when the server goes down, you can do this locally. I recommend looking into orchestration tools like mlaunch (http://blog.rueckstiess.com/mtools/mlaunch.html) to help you quickly start up replica sets locally.

  1. Shut down one or more nodes in the cluster. You can do this by connecting to the server directly and using db.shutdownServer() (https://docs.mongodb.com/manual/reference/method/db.shutdownServer/) to test graceful shutdown procedure.
  2. Force an election to test how the driver handles failover. This requires bringing down the primary node. You can do this either by suggestion (1) or running rs.stepDown() on the primary (https://docs.mongodb.com/manual/reference/method/rs.stepDown/).
1 Like

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.