Go driver: gracefully reconnecting when RS topology changes due to pod recreation

Hello,

I am using Monstache which under the hood uses the Go MongoDB driver. Driver version is v1.10.3 & connects to Mongo 4.4.

The issue I am facing is as follows:

  • I have Mongo deployed & reachable under host mongo.db.0
  • It’s replica set is initialized & uses the IP address of Mongo (this is controlled by an external tool, which allows for autosizing the RS size & adding/removing new nodes; however it is a single node RS in this case)
  • Monstache connects to this Mongo using the Go driver. The connection string lists mongo.db.0 & uses the replicaSet flag.
  • The driver discovers the topology & uses the IP address to access the MongoDB node

All is good

  • However, then MongoDB is cycled. The new MongoDB comes up and is initialized to a new IP address (as far as i can tell, this is all within reason)
  • The Go driver still tries to access using the previously discovered IP address, rather then rediscovering topology by using the connection string again. It prints:
    server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: <OLD_IP_ADDR>:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp <OLD_IP_ADDR>:27017: i/o timeout }, ] }

My questions are:

  • Am I missing some configuration within the Go driver to allow graceful reconnecting after RS topology change
  • Is it the responsibility of the Go driver to handle this or in fact the user of the driver (in this case Monstache) to detect this error & attempt to reconnect again?

Thank you,
Max

1 Like

I don’t know how your specific driver behaves, but probably something is wrong with your configuration/use on the driver-server communication.

The error makes sense because the old ip is long gone and it’s still trying to access it.

Maybe the driver is not periodically refreshing the mapped ip address? i don’t know.

Hello @Max_Dudzinski,

Welcome to the MongoDB Community forums :sparkles:

As @Kobe_W also mentioned if you change the IP, it’s gone and this will probably happen.

To better understand the issue, can you please share the output of rs.conf() and rs.status().

At first glance, I don’t think this is related to the Go driver, but rather the change in DNS in your environment.

As far as I know, the go driver uses the default resolver from the net package. Also, as per the JIRA ticket - the Go driver does not cache DNS and instead relies on the OS and its resolvers.

So if the IP is stale, the DNS cache is the possible issue as it could be in the OS, network, etc.

Furthermore, if you are looking to integrate the search solution into the MongoDB Atlas Dataset, I’ll recommend using Atlas search for better compatibility and using the combination of three systems database, search engine, and sync mechanisms into one, delivering application search experiences much faster.

For more information, please visit the MongoDB Atlas Search documentation.

Best,
Kushagra

1 Like

Hi @Kobe_W, Hello @Kushagra_Kesav,

Thank you for your replies.

Before

Go driver successfully connected, no errors. Host in connection string is: solidatus-db-0.db

RS status & config output before
rs0:PRIMARY> rs.status()
{
	"set" : "rs0",
	"date" : ISODate("2023-03-22T08:30:34.395Z"),
	"myState" : 1,
	"term" : NumberLong(2),
	"syncingTo" : "",
	"syncSourceHost" : "",
	"syncSourceId" : -1,
	"heartbeatIntervalMillis" : NumberLong(2000),
	"majorityVoteCount" : 1,
	"writeMajorityCount" : 1,
	"optimes" : {
		"lastCommittedOpTime" : {
			"ts" : Timestamp(1679473830, 7),
			"t" : NumberLong(2)
		},
		"lastCommittedWallTime" : ISODate("2023-03-22T08:30:30.674Z"),
		"readConcernMajorityOpTime" : {
			"ts" : Timestamp(1679473830, 7),
			"t" : NumberLong(2)
		},
		"readConcernMajorityWallTime" : ISODate("2023-03-22T08:30:30.674Z"),
		"appliedOpTime" : {
			"ts" : Timestamp(1679473830, 7),
			"t" : NumberLong(2)
		},
		"durableOpTime" : {
			"ts" : Timestamp(1679473830, 7),
			"t" : NumberLong(2)
		},
		"lastAppliedWallTime" : ISODate("2023-03-22T08:30:30.674Z"),
		"lastDurableWallTime" : ISODate("2023-03-22T08:30:30.674Z")
	},
	"lastStableRecoveryTimestamp" : Timestamp(1679473810, 1),
	"lastStableCheckpointTimestamp" : Timestamp(1679473810, 1),
	"electionCandidateMetrics" : {
		"lastElectionReason" : "electionTimeout",
		"lastElectionDate" : ISODate("2023-01-12T08:13:22.040Z"),
		"electionTerm" : NumberLong(2),
		"lastCommittedOpTimeAtElection" : {
			"ts" : Timestamp(0, 0),
			"t" : NumberLong(-1)
		},
		"lastSeenOpTimeAtElection" : {
			"ts" : Timestamp(1673511115, 1),
			"t" : NumberLong(1)
		},
		"numVotesNeeded" : 1,
		"priorityAtElection" : 1,
		"electionTimeoutMillis" : NumberLong(10000),
		"newTermStartDate" : ISODate("2023-01-12T08:13:22.042Z"),
		"wMajorityWriteAvailabilityDate" : ISODate("2023-01-12T08:13:22.094Z")
	},
	"members" : [
		{
			"_id" : 1,
			"name" : "10.1.95.24:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 5962708,
			"optime" : {
				"ts" : Timestamp(1679473830, 7),
				"t" : NumberLong(2)
			},
			"optimeDate" : ISODate("2023-03-22T08:30:30Z"),
			"syncingTo" : "",
			"syncSourceHost" : "",
			"syncSourceId" : -1,
			"infoMessage" : "",
			"electionTime" : Timestamp(1673511202, 1),
			"electionDate" : ISODate("2023-01-12T08:13:22Z"),
			"configVersion" : 182936,
			"self" : true,
			"lastHeartbeatMessage" : ""
		}
	],
	"ok" : 1,
	"$clusterTime" : {
		"clusterTime" : Timestamp(1679473830, 7),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	},
	"operationTime" : Timestamp(1679473830, 7)
}
rs0:PRIMARY> rs.config()
{
	"_id" : "rs0",
	"version" : 182936,
	"protocolVersion" : NumberLong(1),
	"writeConcernMajorityJournalDefault" : true,
	"members" : [
		{
			"_id" : 1,
			"host" : "10.1.95.24:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 10000,
		"catchUpTimeoutMillis" : -1,
		"catchUpTakeoverDelayMillis" : 30000,
		"getLastErrorModes" : {
			
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		},
		"replicaSetId" : ObjectId("6311ccaab8d114d352e0655e")
	}
}

Mongo reachable as:

$ date 
Wed Mar 22 08:29:50 UTC 2023
$ curl solidatus-db-0.db:27017
It looks like you are trying to access MongoDB over HTTP on the native driver port.

After

Then MongoDB restarts, comes up under a new IP address:

RS status & config output after
rs0:PRIMARY> rs.status()
{
	"set" : "rs0",
	"date" : ISODate("2023-03-22T08:34:24.906Z"),
	"myState" : 1,
	"term" : NumberLong(3),
	"syncingTo" : "",
	"syncSourceHost" : "",
	"syncSourceId" : -1,
	"heartbeatIntervalMillis" : NumberLong(2000),
	"majorityVoteCount" : 1,
	"writeMajorityCount" : 1,
	"optimes" : {
		"lastCommittedOpTime" : {
			"ts" : Timestamp(1679474063, 6),
			"t" : NumberLong(3)
		},
		"lastCommittedWallTime" : ISODate("2023-03-22T08:34:23.612Z"),
		"readConcernMajorityOpTime" : {
			"ts" : Timestamp(1679474063, 6),
			"t" : NumberLong(3)
		},
		"readConcernMajorityWallTime" : ISODate("2023-03-22T08:34:23.612Z"),
		"appliedOpTime" : {
			"ts" : Timestamp(1679474063, 6),
			"t" : NumberLong(3)
		},
		"durableOpTime" : {
			"ts" : Timestamp(1679474063, 6),
			"t" : NumberLong(3)
		},
		"lastAppliedWallTime" : ISODate("2023-03-22T08:34:23.612Z"),
		"lastDurableWallTime" : ISODate("2023-03-22T08:34:23.612Z")
	},
	"lastStableRecoveryTimestamp" : Timestamp(1679473965, 6),
	"lastStableCheckpointTimestamp" : Timestamp(1679473965, 6),
	"electionCandidateMetrics" : {
		"lastElectionReason" : "electionTimeout",
		"lastElectionDate" : ISODate("2023-03-22T08:34:14.203Z"),
		"electionTerm" : NumberLong(3),
		"lastCommittedOpTimeAtElection" : {
			"ts" : Timestamp(0, 0),
			"t" : NumberLong(-1)
		},
		"lastSeenOpTimeAtElection" : {
			"ts" : Timestamp(1679473965, 6),
			"t" : NumberLong(2)
		},
		"numVotesNeeded" : 1,
		"priorityAtElection" : 1,
		"electionTimeoutMillis" : NumberLong(10000),
		"newTermStartDate" : ISODate("2023-03-22T08:34:14.206Z"),
		"wMajorityWriteAvailabilityDate" : ISODate("2023-03-22T08:34:14.249Z")
	},
	"members" : [
		{
			"_id" : 0,
			"name" : "10.1.95.4:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 86,
			"optime" : {
				"ts" : Timestamp(1679474063, 6),
				"t" : NumberLong(3)
			},
			"optimeDate" : ISODate("2023-03-22T08:34:23Z"),
			"syncingTo" : "",
			"syncSourceHost" : "",
			"syncSourceId" : -1,
			"infoMessage" : "",
			"electionTime" : Timestamp(1679474054, 1),
			"electionDate" : ISODate("2023-03-22T08:34:14Z"),
			"configVersion" : 276040,
			"self" : true,
			"lastHeartbeatMessage" : ""
		}
	],
	"ok" : 1,
	"$clusterTime" : {
		"clusterTime" : Timestamp(1679474063, 6),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	},
	"operationTime" : Timestamp(1679474063, 6)
}
rs0:PRIMARY> rs.config()
{
	"_id" : "rs0",
	"version" : 276040,
	"protocolVersion" : NumberLong(1),
	"writeConcernMajorityJournalDefault" : true,
	"members" : [
		{
			"_id" : 0,
			"host" : "10.1.95.4:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
				
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 10000,
		"catchUpTimeoutMillis" : -1,
		"catchUpTakeoverDelayMillis" : 30000,
		"getLastErrorModes" : {
			
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		},
		"replicaSetId" : ObjectId("6311ccaab8d114d352e0655e")
	}
}

Mongo still reachable under same host:

$ date
Wed Mar 22 08:35:02 UTC 2023
$ curl solidatus-db-0.db:27017
It looks like you are trying to access MongoDB over HTTP on the native driver port.

MongoDB Go driver reports

Error starting change stream. Will retry: server selection error: server selection timeout, current topology: { Type: ReplicaSetNoPrimary, Servers: [{ Addr: 10.1.95.24:27017, Type: Unknown, Last error: connection() error occured during connection handshake: dial tcp 10.1.95.24:27017: i/o timeout }, ] }

From what I understand, the driver has itself cached the topology with single member at IP address 10.1.95.24 which has now become stale.
Shouldn’t the driver go back and use it’s provided connection string to re-discover the topology & new IP of member i.e. 10.1.95.4?

Thank you for your time,
Max

Hi :wave: @Max_Dudzinski,

Thanks for sharing the details.

As per the shared information,

  • The hostname of the replica set is identified as "solidatus-db-0.db."
  • The output of the “rs.status()” command shows a member name of "10.1.95.4:27017" which is the IP address of the MongoDB instance.
  • So, when the MongoDB restarts, and comes up under a new IP address it would essentially be a new instance with a new setup. , For example, if the instance is running in a container or VM that is moved to a new host.
  • Now when the driver tries to connect to the MongoDB instance and is configured to use the DNS hostname instead of the IP address, the DNS record is not updated to reflect the new IP address, and the driver will not be able to connect to the new instance.

I believe the reason why the driver cannot reconnect to the new host is that the replica set was configured with the actual IP address instead of using a hostname as per the recommendations in the documentation.

This is because if you use IP addresses, any changes in the IP address of a member will require updating the configuration file of all other members, which can be time-consuming and error-prone. On the other hand, if you use a DNS hostname, the IP address of a member can change without requiring any configuration updates on other members.

For example, If you use hostnames like member1.example.com, member2.example.com, and member3.example.com to configure a MongoDB replica set, any changes in their IP addresses will be automatically resolved by DNS without needing to update the configuration file.

To resolve this - I will recommend you change the config of your replica set to use hostnames instead of IP addresses.

For detailed information please refer:

I hope it helps!

Best,
Kushagra

2 Likes

This issue with the driver not being able to recover when the cluster topology might be related to this issue that was fixed in 1.10.4 onwards.

Can you try the latest 1.10.x driver and see if the issue still happens?

1 Like

Hi @Mavericks2022,

Thanks for your comment. I’ve looked at the issue & resultant commit, and I believe the fix in the issue is strictly to do with SRV polling, which is not what I’m using.

I will try out the latest driver later on in the hope that it works.


Hi @Kushagra_Kesav,

Thanks for the follow up.

  • Now when the driver tries to connect to the MongoDB instance and is configured to use the DNS hostname instead of the IP address, the DNS record is not updated to reflect the new IP address, and the driver will not be able to connect to the new instance.

The DNS record within the environment is correctly up to date - you can see that from the second curl solidatus-db-0.db:27017 command I ran, after MongoDB has restarted & come up with a new IP address & the RS was reconfigured.

The problem is that the driver itself is caching & not updating the stale topology - if it simply re-connected to solidatus-db-0.db:27017 & rediscovered the updated topology with the updated IP addrs & connected to it instead, all would be fine.

To resolve this - I will recommend you change the config of your replica set to use hostnames instead of IP addresses.

I appreciate this is the recommended best practise - in my case however, the 3rd party tool responsible for maintaining RS in my changing environment unfortunately does not support the use of hostnames.

Further, RS reconfigurations should be expected to happen, for various reasons.
While the use of IP addresses in RS config’s may not be the best, it is a valid configuration option.
I simply believe it is a pretty bad oversight from the MongoDB Go driver to not do something smarter, like re-try connecting to one of the nodes via the original connection string.


I have resolved my issue by manually detecting this type of connection error via log parsing & triggering a restart of the entire process which was using the MongoDB Go driver, causing a fresh connect via connection string to MongoDB & thus a fresh discovery of updated topology

@Max_Dudzinski thanks for all the information, this is a really interesting problem! There is actually a section from the MongoDB driver specification that describes the expected behavior of drivers under similar circumstances, and some rationale for those decisions:

Clients use the hostnames listed in the replica set config, not the seed list

An alternative proposal is for clients to continue using the hostnames in the seed list. It could add new hosts from the hello or legacy hello response, and where a host is known by two names, the client can deduplicate them using the “me” field and prefer the name in the seed list.

This proposal was rejected because it does not support key features of replica sets: failover and zero-downtime reconfiguration.

In our example, if “host1” and “host2” are not reachable from the client, the client continues to use “host_alias” only. If that server goes down or is removed by a replica set reconfig, the client is suddenly unable to reach the replica set at all: by allowing the client to use the alias, we have hidden the fact that the replica set’s failover feature will not work in a crisis or during a reconfig.

Basically, MongoDB drivers connect to the replica set nodes as described by the replica set (i.e. the information that rs.status() returns) because they depend on timely and accurate topology change info from the MongoDB replica set to support “failover and zero-downtime reconfiguration”.

When a driver has completely lost connection to a replica set, there are two possible circumstances:

  1. The replica set is still there, but there is some network interruption.
  2. All replica set nodes have moved to a totally new set of hosts and/or IPs in a short period of time and might be rediscoverable with the connection string.

Drivers could simultaneously attempt to connect to the last known MongoDB replia set and re-initialize using the connection string to see which succeeds first. However, that may not always be the best behavior for all use cases, so we have historically assumed case #1 (the more common case) and required users to implement their own recovery logic for case #2.

Arbiter nodes?

Another section from the specification seems to suggest that using arbiter nodes can help with the case where all replica set members are moved in a short period of time:

The client MUST monitor arbiters

… in the rare case that all data members are moved to new hosts in a short time, an arbiter may be the client’s last hope to find the new replica set configuration.

Do you have the option of running arbiter nodes that could help the Go driver keep track of the replica set node after it is moved? If not, it sounds like your solution to detect the error and reconnect is the correct solution.

2 Likes

Hi @Matt_Dale,

Thank you very much for the detailed answer. The links posted are extremely helpful, I unfortunately missed them during my Googling :slight_smile:

I’m not sure I agree with this part of the docs:

… by allowing the client to use the alias, we have hidden the fact that the replica set’s failover feature will not work in a crisis or during a reconfig.

Imo the reconfig is an internal change that should be invisible to users who only know host_alias - if mongo is still reachable under the alias, the driver should reconnect.

Anyway, thanks again for taking the time to answer, so far the manual reconnect seems to be working fine (i don’t really have/want arbiter nodes :))

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.