How to force a reload of the replica-set config in shard servers

Rob_De_Langhe · July 22, 2020, 11:47am

hi all,
we are deploying a number of replica-sets, each consisting of 3 servers. For one of those servers of a new replica set “db_rs017”, we made an error and assigned the IP addr of another server that is already configured in its own, other replica set “db_rs006”
That has been corrected fairly quick, but all other servers is our cluster are since then reporting their confusion:
2020-07-22T11:19:30.519+0200 E NETWORK [ReplicaSetMonitor-TaskExecutor] replset name mismatch: expected "db_rs006", but remote node mongop_db0063:27018 has replset name "db_rs017", ismaster: { hosts: [ "mongop_db0171:27018" ], passives: [ "mongop_db0172:27018", "mongop_db0173:27018" ], setName: "db_rs017", setVersion: 5, ismaster: false, secondary: true, primary: "mongop_db0171:27018", passive: true, me: "mongop_db0172:27018", lastWrite: { opTime: { ts: Timestamp(1595409557, 109), t: 1 }, lastWriteDate: new Date(1595409557000), majorityOpTime: { ts: Timestamp(1595409557, 109), t: 1 }, majorityWriteDate: new Date(1595409557000) }, maxBsonObjectSize: 16777216, maxMessageSizeBytes: 48000000, maxWriteBatchSize: 100000, localTime: new Date(1595409537042), logicalSessionTimeoutMinutes: 30, connectionId: 46, minWireVersion: 0, maxWireVersion: 8, readOnly: false, compression: [ "snappy", "zstd", "zlib" ], ok: 1.0, $gleStats: { lastOpTime: Timestamp(0, 0), electionId: ObjectId('000000000000000000000000') }, lastCommittedOpTime: Timestamp(1595409557, 109), $configServerState: { opTime: { ts: Timestamp(1595409549, 36), t: 5 } }, $clusterTime: { clusterTime: Timestamp(1595409572, 11), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } }, operationTime: Timestamp(1595409557, 109) }

I tried already a few actions:

stop the routers and all config servers, and restart them: no luck
stop the server “mongop_db0063” and remove it from the replica set “db_rs006”, then re-add it: no luck

If I read out the replica-set info from server “mongop_db0063”, it correctly reports that it belongs to “db_rs006”:
$ mongo localhost:27018 -u mongo-admin -p$MyPwd --authenticationDatabase admin --eval "db.adminCommand({replSetGetStatus:1})" MongoDB server version: 4.2.8 { "set" : "db_rs006" ... }

I can not trash this replica set, cos it contains data.

=> Where sits the misconfiguration ? Is it in the config servers (how to fix it), or in all the shard servers (how to fix it) ?

many thx in advance to anyone who finds the time for a quick answer !

chris · July 22, 2020, 8:32pm

Super fun. Sounds more like a networking issue than a mongo one.

After 9hrs hopefully this is corrected.

I would be checking/clearing arp entries on the mongos and nodes in the db_rs-006. Might be required to do this on network switches and routers too.

Rob_De_Langhe · July 23, 2020, 8:06am

hi Chris,
thx very much for taking the time to answer.
Our issue is lasting already for a few weeks, so slightly longer than 9hrs
ARP cache entries have a Time-To-Live of 180 secs, so that won’t be the solution to clear now the ‘history’ of this incorrect replica-set name for server “mongop_db0063”.
I assume this correlation between server and replica-set name is stored somewhere in a file on the servers (which servers? which file(s) ?)

chris · July 23, 2020, 1:18pm

As you say the node itself is configured correctly. The other nodes are connnecting to a node that is configured for db_rs017.

I’m still pretty certain you are experiencing a network issue, not a mongo one. I’m pretty sure that mongo is not caching a name to ip. This is evident as when your network misconfiguration occurred the cluster stared reporting the error.

You should try connecting to mongop_db0063:27018 from one of it replicaset peers and run the same repSetGetStatus. I’d be surprised if it returns db_rs006.

Have you restarted the host or network stack of mongop_db0063 ?

Rob_De_Langhe · July 24, 2020, 7:04pm

ok, I have restarted the entire cluster… The issue is gone now. Good to know in case this might happen again (we won’t make IP mistakes anymore, for sure )

thx Chris for your replies

system · July 29, 2020, 7:04pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.