Hi John,
The driver monitors the deployment using one goroutine per node in the cluster so generally, there should be nothing required on your end to reconnect in the event of a transient issue. For example, when connecting to a replica set, the driver will detect the new primary in the event of a failover or node restart. In addition, the driver will retry certain read and write operations by default in the event of a transient error.
The only thing I can think of checking for that would indicate an issue is a server selection timeout. By default, the driver will try to find a suitable server for an operation for 30 seconds. This is generally enough time for transient network issues to resolve themselves. If your application receives a server selection error, that could be indicative of a more serious issue or a signal that you need to set the server selection timeout to a higher value (e.g. if primary elections are consistently taking 45 seconds, try setting the timeout to 1 minute).
The current driver version (1.3.0 at the time of writing) does not return a custom error type for server selection errors, but you can check for the “server selection error” substring. We have a project planned for this quarter to improve the error types returned by the driver, and I plan on adding a concrete type that users can check for as part of that.
– Divjot