Weird prompt status replicaset node after upgrade from 5.0.14 to 6.0.3

AmitG · November 20, 2022, 4:19pm

I don’t have a lot of details of the problem since we had to revert our upgrade in production pretty quickly. But we have a sharded cluster with 3 shards (mongocluster1,mongocluster2,mongocluster3). Each shard is a 3-node RS. The config servers are a 3-node RS as well (configReplSet). The configReplSet and 2 of the shards (mongocluster1 and mongocluster2) seemed to upgrade fine. However, in our second shard (mongocluster2 which consists of vindb2, pndb2 and daldb2), daldb2 showed the following status in mongosh prompt. Notice that the state says [direct: other] however, rs.status() shows SECONDARY for the node. :

mongocluster2 [direct: other] test> rs.status()
{
  set: 'mongocluster2',
  date: ISODate("2022-11-20T07:30:30.003Z"),
  myState: 2,
  term: Long("66"),
  syncSourceHost: 'pndb2:27017',
  syncSourceId: 7,
  heartbeatIntervalMillis: Long("2000"),
  majorityVoteCount: 2,
  writeMajorityCount: 2,
  votingMembersCount: 2,
  writableVotingMembersCount: 2,
  optimes: {
    lastCommittedOpTime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
    lastCommittedWallTime: ISODate("2022-11-20T07:30:29.566Z"),
    readConcernMajorityOpTime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
    appliedOpTime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
    durableOpTime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
    lastAppliedWallTime: ISODate("2022-11-20T07:30:29.566Z"),
    lastDurableWallTime: ISODate("2022-11-20T07:30:29.566Z")
  },
  lastStableRecoveryTimestamp: Timestamp({ t: 1668929239, i: 1 }),
  members: [
    {
      _id: 6,
      name: 'vindb2:27017',
      health: 1,
      state: 1,
      stateStr: 'PRIMARY',
      uptime: 37,
      optime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
      optimeDurable: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
      optimeDate: ISODate("2022-11-20T07:30:29.000Z"),
      optimeDurableDate: ISODate("2022-11-20T07:30:29.000Z"),
      lastAppliedWallTime: ISODate("2022-11-20T07:30:29.566Z"),
      lastDurableWallTime: ISODate("2022-11-20T07:30:29.566Z"),
      lastHeartbeat: ISODate("2022-11-20T07:30:29.881Z"),
      lastHeartbeatRecv: ISODate("2022-11-20T07:30:28.785Z"),
      pingMs: Long("29"),
      lastHeartbeatMessage: '',
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      electionTime: Timestamp({ t: 1668928629, i: 1 }),
      electionDate: ISODate("2022-11-20T07:17:09.000Z"),
      configVersion: 27,
      configTerm: 66
    },
    {
      _id: 7,
      name: 'pndb2:27017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 37,
      optime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
      optimeDurable: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
      optimeDate: ISODate("2022-11-20T07:30:29.000Z"),
      optimeDurableDate: ISODate("2022-11-20T07:30:29.000Z"),
      lastAppliedWallTime: ISODate("2022-11-20T07:30:29.566Z"),
      lastDurableWallTime: ISODate("2022-11-20T07:30:29.566Z"),
      lastHeartbeat: ISODate("2022-11-20T07:30:29.733Z"),
      lastHeartbeatRecv: ISODate("2022-11-20T07:30:29.465Z"),
      pingMs: Long("22"),
      lastHeartbeatMessage: '',
      syncSourceHost: 'vindb2:27017',
      syncSourceId: 6,
      infoMessage: '',
      configVersion: 27,
      configTerm: 66
    },
    {
      _id: 11,
      name: 'daldb2:27017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 39,
      optime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
      optimeDate: ISODate("2022-11-20T07:30:29.000Z"),
      lastAppliedWallTime: ISODate("2022-11-20T07:30:29.566Z"),
      lastDurableWallTime: ISODate("2022-11-20T07:30:29.566Z"),
      syncSourceHost: 'pndb2:27017',
      syncSourceId: 7,
      infoMessage: '',
      configVersion: 27,
      configTerm: 66,
      self: true,
      lastHeartbeatMessage: ''
    }
  ],
  ok: 1,
  '$gleStats': {
    lastOpTime: Timestamp({ t: 0, i: 0 }),
    electionId: ObjectId("000000000000000000000000")
  },
  lastCommittedOpTime: Timestamp({ t: 1668929429, i: 1 }),
  '$clusterTime': {
    clusterTime: Timestamp({ t: 1668929429, i: 1 }),
    signature: {
      hash: Binary(Buffer.from("0000000000000000000000000000000000000000", "hex"), 0),
      keyId: Long("0")
    }
  },
  operationTime: Timestamp({ t: 1668929429, i: 1 })
}

I don’t have much to work on. The only other clue I saw was in some app logs that had an error like:

"errmsg" : "Encountered non-retryable error during query :: caused by :: BSON field 'DatabaseVersion.timestamp' is missing but a required field"

Does anyone know why the prompt would show [direct: other] while rs.status() shows everything is fine?

AmitG · November 20, 2022, 6:19pm

While doing more research, I came across the following closed bug: https://jira.mongodb.org/browse/SERVER-68511

Although the versions I’m running are supposedly the fixed, the comments about $version.timestamp and the behavior of the nodes not moving primaries correctly looks suspiciously close to the issue we were facing.

Tarun_Gaur · November 24, 2022, 4:59am

Hello @AmitG ,

Typically the status [direct: other] means you are connected to a node that is not in the state of Primary, Secondary or Arbiter. There are other replica set member states other than those three, but most of the time they are transient and will settle themselves to either Primary, Secondary, or Arbiter. Nodes in this state cannot be queried, but their hosts lists are useful for discovering the current replica set configuration. For more information, please check below link on RSGhost and RSOther.

https://github.com/mongodb/specifications/blob/master/source/server-discovery-and-monitoring/server-discovery-and-monitoring.rst#rsghost-and-rsother

Could you please confirm if you are facing any issues apart from prompt showing [direct: other] or if things are working as expected?

Regards,
Tarun

AmitG · November 24, 2022, 6:04am

Hi Tarun,

While I was upgrading production, I saw this prompt (Even though rs.status() showed 1 primary and 2 secondary members). You can see the rs.status() in my post above to verify. It was not transient… The prompt was stuck in that state… I tried restarting the mongod service.

While this was happening, I noticed a major outage in our production applications not being able to connect to our mongo sharded cluster. I had to revert to 5.0 quickly so I didn’t have time to gather more info. I did find the error message that was in the message above about “BSON field ‘DatabaseVersion.timestamp’ is missing but a required field” I think it may be related, but I’m not sure.

We have done every major upgrade on MongoDB over the years without failure… This latest one has me hesitant to try again soon. That being said, I’m happy to try again if the MongoDB team wants to watch and assist on a weekend

Todd_Vernick · October 3, 2023, 6:57pm

Were you able to find a fix for this issue?

denny99 · February 19, 2024, 1:38pm

Same issue here, after upgrading from 5.0.24 to 6.0.13. No solution yet, had to downgrade.

Yadu_Krishnan · May 9, 2024, 11:37am

We are also facing the same issue. Upgraded from 5.0.24 to 6.0.14. Got this issue and reverted to older one.
Any solution?

denny99 · May 13, 2024, 9:10am

You need to follow the upgrade guide on MongoDB website.

https://www.mongodb.com/docs/manual/release-notes/6.0-upgrade-sharded-cluster/#refresh-the-cached-routing-table-for-each-mongos.

You’ve got to flush the mongos cached routing table before upgrading, and you’ll solve the issue.