All mongoDB shard replica set members in recovery state

John_Ratliff · April 3, 2020, 4:47pm

I’m trying to upgrade some of our mongoDB hosts. This will change them from CentOS 6 and mongoDB 3.0 to CentOS 7 and mongoDB 3.6.

We have a shard replica set and a 3-member sharded config server. There are three hosts, with two of them having a full shard replica, while the other is just an arbiter.

I’m trying to test the upgrade procedure on some other virtual machines. I haven’t had a problem with the other upgrades, but they didn’t have a shard replica set. They were standalone shards that were only later converted into a single member shard replica set.

To test this, I shutdown the arbiter config server and copied its files to the three test VMs. Since they are supposed to be identical copies, I assume it doesn’t matter which host I get these files from. This appears to work. I setup mongos with the new config server names and I was able to update the replica set hostnames.

For the shard replica set, I shutdown the secondary shard and copied its files to the two non-arbiter test hosts. I started mongod with the replica set name on the arbiter with a blank data directory.

I followed this guide (https://docs.mongodb.com/manual/tutorial/change-hostnames-in-a-replica-set/#replica-set-change-hostname-downtime) to change the replica set configuration hostnames and then started mongod with the replica set name on each of the two non-arbiter test hosts.

The arbiter node seems okay, and I can run rs.status() from any of the three nodes, but the two nodes with data are in ‘RECOVERING’ state. I don’t know if this is normal, or if they are stuck there. We are far past the last piece of information in the oplog.

Is there a step I have missed? Will I need to use more recent shard backup data? How can I recover from ‘RECOVERING’? Can I force one of the members to become primary? The docs suggest that if a member falls far enough behind, it may require manual intervention. But I don’t know what to do about that? It seems to suggest a full resync, but with no member being primary, I don’t know how I would do that.

Thanks.

chris · April 5, 2020, 9:28pm

An arbiter does not hold data for the repilcaSet. You need to copy the data from a Secondary or Primary.

There is another tutorial for what you are doing:

John_Ratliff · April 6, 2020, 1:50pm

The config servers are not replica sets. They are sharded config servers because mongo 3.0 did not support replica set config servers. So the only data I copied from the arbiter host was for the config server.

That document doesn’t really cover my situation. It talks about moving servers and then talking to the primary to make sure data is still in sync. In my scenario, restoring the data from a replica shard set to new hosts, no host or server was ever primary and I don’t know how to make one be so.

I was able to resync a newer version of the replica set data, and that worked, but I don’t understand why it worked and the first copy did not. Perhaps it was a fluke?