I don’t have a lot of details of the problem since we had to revert our upgrade in production pretty quickly. But we have a sharded cluster with 3 shards (mongocluster1,mongocluster2,mongocluster3). Each shard is a 3-node RS. The config servers are a 3-node RS as well (configReplSet). The configReplSet and 2 of the shards (mongocluster1 and mongocluster2) seemed to upgrade fine. However, in our second shard (mongocluster2 which consists of vindb2, pndb2 and daldb2), daldb2 showed the following status in mongosh prompt. Notice that the state says [direct: other] however, rs.status() shows SECONDARY for the node. :
mongocluster2 [direct: other] test> rs.status()
{
set: 'mongocluster2',
date: ISODate("2022-11-20T07:30:30.003Z"),
myState: 2,
term: Long("66"),
syncSourceHost: 'pndb2:27017',
syncSourceId: 7,
heartbeatIntervalMillis: Long("2000"),
majorityVoteCount: 2,
writeMajorityCount: 2,
votingMembersCount: 2,
writableVotingMembersCount: 2,
optimes: {
lastCommittedOpTime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
lastCommittedWallTime: ISODate("2022-11-20T07:30:29.566Z"),
readConcernMajorityOpTime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
appliedOpTime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
durableOpTime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
lastAppliedWallTime: ISODate("2022-11-20T07:30:29.566Z"),
lastDurableWallTime: ISODate("2022-11-20T07:30:29.566Z")
},
lastStableRecoveryTimestamp: Timestamp({ t: 1668929239, i: 1 }),
members: [
{
_id: 6,
name: 'vindb2:27017',
health: 1,
state: 1,
stateStr: 'PRIMARY',
uptime: 37,
optime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
optimeDurable: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
optimeDate: ISODate("2022-11-20T07:30:29.000Z"),
optimeDurableDate: ISODate("2022-11-20T07:30:29.000Z"),
lastAppliedWallTime: ISODate("2022-11-20T07:30:29.566Z"),
lastDurableWallTime: ISODate("2022-11-20T07:30:29.566Z"),
lastHeartbeat: ISODate("2022-11-20T07:30:29.881Z"),
lastHeartbeatRecv: ISODate("2022-11-20T07:30:28.785Z"),
pingMs: Long("29"),
lastHeartbeatMessage: '',
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
electionTime: Timestamp({ t: 1668928629, i: 1 }),
electionDate: ISODate("2022-11-20T07:17:09.000Z"),
configVersion: 27,
configTerm: 66
},
{
_id: 7,
name: 'pndb2:27017',
health: 1,
state: 2,
stateStr: 'SECONDARY',
uptime: 37,
optime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
optimeDurable: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
optimeDate: ISODate("2022-11-20T07:30:29.000Z"),
optimeDurableDate: ISODate("2022-11-20T07:30:29.000Z"),
lastAppliedWallTime: ISODate("2022-11-20T07:30:29.566Z"),
lastDurableWallTime: ISODate("2022-11-20T07:30:29.566Z"),
lastHeartbeat: ISODate("2022-11-20T07:30:29.733Z"),
lastHeartbeatRecv: ISODate("2022-11-20T07:30:29.465Z"),
pingMs: Long("22"),
lastHeartbeatMessage: '',
syncSourceHost: 'vindb2:27017',
syncSourceId: 6,
infoMessage: '',
configVersion: 27,
configTerm: 66
},
{
_id: 11,
name: 'daldb2:27017',
health: 1,
state: 2,
stateStr: 'SECONDARY',
uptime: 39,
optime: { ts: Timestamp({ t: 1668929429, i: 1 }), t: Long("66") },
optimeDate: ISODate("2022-11-20T07:30:29.000Z"),
lastAppliedWallTime: ISODate("2022-11-20T07:30:29.566Z"),
lastDurableWallTime: ISODate("2022-11-20T07:30:29.566Z"),
syncSourceHost: 'pndb2:27017',
syncSourceId: 7,
infoMessage: '',
configVersion: 27,
configTerm: 66,
self: true,
lastHeartbeatMessage: ''
}
],
ok: 1,
'$gleStats': {
lastOpTime: Timestamp({ t: 0, i: 0 }),
electionId: ObjectId("000000000000000000000000")
},
lastCommittedOpTime: Timestamp({ t: 1668929429, i: 1 }),
'$clusterTime': {
clusterTime: Timestamp({ t: 1668929429, i: 1 }),
signature: {
hash: Binary(Buffer.from("0000000000000000000000000000000000000000", "hex"), 0),
keyId: Long("0")
}
},
operationTime: Timestamp({ t: 1668929429, i: 1 })
}
I don’t have much to work on. The only other clue I saw was in some app logs that had an error like:
"errmsg" : "Encountered non-retryable error during query :: caused by :: BSON field 'DatabaseVersion.timestamp' is missing but a required field"
Does anyone know why the prompt would show [direct: other] while rs.status() shows everything is fine?
