Migrating a Sharded Cluster


In the coming days I will be responsible for migrating an existing 4.0 sharded cluster to a new network. Unfortunately, at the time it was set up our internal DNS was not reliable enough to use hostnames instead of IP addresses. There were better workarounds even back then, but that ship has sailed and this is what you see (replacing the Xs with actual IP addresses) when you connect to the config db and run db.shards.findOne():

	"_id" : "shard1",
	"host" : "shard1/XXX.XXX.XXX.XXX:27017,XXX.XXX.XXX.XXX:27017,XXX.XXX.XXX.XXX:27017",
	"state" : 1

We can accept some downtime, but not the kind of downtime associated with dumping and restoring the data into a new cluster on the new network. There are good guides for updating replica set configs and I feel confident in my ability to stand up each shard (a three member replica set) on the new network with new IP bindings, but there is no equivalent sh.reconfig() that I can use to say shard1 is now reachable at these three IP addresses or hostnames.

Aside from the shards collection in the config db I can’t actually find another place where the hostnames for each shard are specified. Would it be as simple as updating these documents in the config db? Feels too good to be true and the lack of a sh.reconfig() probably means that this is way more complicated than changing a few strings.

Thank you for any help or insights you might be able to provide for how to most efficiently do this while preserving the existing state of the cluster at the time that we shut it down for migration.


1 Like

TL;DR - It might actually be as simple as db.shards.updateOne({_id: 'shard1'}, {$set: {host: 'shard1/s1r1:27017, s1r2:27017, s1r3:27017'}). Still can’t find anything in the docs to support this and I certainly wouldn’t recommend doing it live (especially with the balancer running).

Hate to respond to my own question, but I’ll say what we did in case it helps anyone else who needs to rebind a sharded cluster to a new set of IPs.

First we ensured that all members in the cluster had an IP + hostname entry in /etc/hosts with the current set of IPs. We also updated all config files to use bindIpAll: true. Then we stopped the balancer and shutdown the cluster except for the config replica set. We used db.shards.updateOne() on the config db to change the existing IPs in the host value to use hostnames. Finally we used rs.reconfig() to update the config replica set to use hostnames.

We restarted each shard and used rs.reconfig() to use hostnames. Before restarting the query routers we updated the config files to specify the config replica set by hostnames instead of IPs for this field:

  configDB: <configReplSetName>/cfg1.example.net:27019, cfg2.example.net:27019,...

Once the QRs were up, the cluster was live on the existing IP addresses, but in an IP agnostic state. Every member was using a hostname to refer to any other member of the cluster. Changing the IPs was now a simple matter of updating the networking interface on the VM and the /etc/hosts IP => hostname mapping (or using DNS and removing the hardcoded mapping in /etc/hosts). Our cluster is live again on a new network. If there really isn’t more to it than changing the host values for each of the shard documents in the config.shards collection I’m confused as to why there isn’t an sh.reconfig() method.

1 Like