PSS, client-side write timeouts after one secondary member fails

zdi · November 22, 2024, 12:44pm

Hi MongoDB community!

MongoDB Version: 6.0.19, self-managed hardware
Topology: PSS
Node Locations: 2x EU, 1x US
defaultWriteConcern: { w: 2, wtimeout: 0 }

Problem:
We just recently discovered a problem that did not exist in a MongoDB 6.0 version before for us.

Basically whenever a Secondary node is shutdown, all of the writes start timing out at the client side, my guess is that they are just not getting acknowledged anymore, which is not an expected behavior from a PSS topology with defaultWriteConcern w: 2, because we ran the same topology earlier and we had no problems losing one Secondary.

We then switched to a PSSA topology and I thought maybe the additional Arbiter the issue, but after removal of the Arbiter the problem still persists and we have no more clues to whats going on

Maybe someone has any ideas or have had similar issues?

Thank you

rs.status()

{
  set: 'rs1',
  date: 2024-11-21T15:22:20.176Z,
  myState: 1,
  term: Long('28'),
  syncSourceHost: '',
  syncSourceId: -1,
  heartbeatIntervalMillis: Long('2000'),
  majorityVoteCount: 2,
  writeMajorityCount: 2,
  votingMembersCount: 3,
  writableVotingMembersCount: 3,
  optimes: {
    lastCommittedOpTime: { ts: Timestamp({ t: 1732202539, i: 934 }), t: Long('28') },
    lastCommittedWallTime: 2024-11-21T15:22:19.239Z,
    readConcernMajorityOpTime: { ts: Timestamp({ t: 1732202539, i: 934 }), t: Long('28') },
    appliedOpTime: { ts: Timestamp({ t: 1732202540, i: 413 }), t: Long('28') },
    durableOpTime: { ts: Timestamp({ t: 1732202540, i: 301 }), t: Long('28') },
    lastAppliedWallTime: 2024-11-21T15:22:20.156Z,
    lastDurableWallTime: 2024-11-21T15:22:20.109Z
  },
  lastStableRecoveryTimestamp: Timestamp({ t: 1732201811, i: 5268 }),
  electionCandidateMetrics: {
    lastElectionReason: 'stepUpRequestSkipDryRun',
    lastElectionDate: 2024-11-21T10:51:30.501Z,
    electionTerm: Long('28'),
    lastCommittedOpTimeAtElection: { ts: Timestamp({ t: 1732186288, i: 1050 }), t: Long('27') },
    lastSeenOpTimeAtElection: { ts: Timestamp({ t: 1732186288, i: 1050 }), t: Long('27') },
    numVotesNeeded: 2,
    priorityAtElection: 1,
    electionTimeoutMillis: Long('10000'),
    priorPrimaryMemberId: 8,
    numCatchUpOps: Long('0'),
    newTermStartDate: 2024-11-21T10:51:31.152Z,
    wMajorityWriteAvailabilityDate: 2024-11-21T10:51:35.090Z
  },
  electionParticipantMetrics: {
    votedForCandidate: false,
    electionTerm: Long('27'),
    lastVoteDate: 2024-11-21T10:33:30.136Z,
    electionCandidateMemberId: 8,
    voteReason: "candidate's data is staler than mine. candidate's last applied OpTime: { ts: Timestamp(1732185209, 7208), t: 26 }, my last applied OpTime: { ts: Timestamp(1732185210, 647), t: 26 }",
    lastAppliedOpTimeAtElection: { ts: Timestamp({ t: 1732185210, i: 647 }), t: Long('26') },
    maxAppliedOpTimeInSet: { ts: Timestamp({ t: 1732185210, i: 647 }), t: Long('26') },
    priorityAtElection: 1
  },
  members: [
    {
      _id: 8,
      name: '<IP>:27017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 15324,
      optime: [Object],
      optimeDurable: [Object],
      optimeDate: 2024-11-21T15:22:18.000Z,
      optimeDurableDate: 2024-11-21T15:22:18.000Z,
      lastAppliedWallTime: 2024-11-21T15:22:19.950Z,
      lastDurableWallTime: 2024-11-21T15:22:19.239Z,
      lastHeartbeat: 2024-11-21T15:22:18.956Z,
      lastHeartbeatRecv: 2024-11-21T15:22:19.103Z,
      pingMs: Long('119'),
      lastHeartbeatMessage: '',
      syncSourceHost: '<IP>:27017',
      syncSourceId: 10,
      infoMessage: '',
      configVersion: 29,
      configTerm: 28
    },
    {
      _id: 9,
      name: '<IP>:27017',
      health: 1,
      state: 2,
      stateStr: 'SECONDARY',
      uptime: 92273,
      optime: [Object],
      optimeDurable: [Object],
      optimeDate: 2024-11-21T13:50:23.000Z,
      optimeDurableDate: 2024-11-21T13:50:23.000Z,
      lastAppliedWallTime: 2024-11-21T13:50:24.485Z,
      lastDurableWallTime: 2024-11-21T13:50:24.485Z,
      lastHeartbeat: 2024-11-21T15:22:18.741Z,
      lastHeartbeatRecv: 2024-11-21T15:22:19.072Z,
      pingMs: Long('119'),
      lastHeartbeatMessage: '',
      syncSourceHost: '<IP>:27017',
      syncSourceId: 8,
      infoMessage: '',
      configVersion: 29,
      configTerm: 28
    },
    {
      _id: 10,
      name: '<IP>:27017',
      health: 1,
      state: 1,
      stateStr: 'PRIMARY',
      uptime: 149861,
      optime: [Object],
      optimeDate: 2024-11-21T15:22:20.000Z,
      lastAppliedWallTime: 2024-11-21T15:22:20.156Z,
      lastDurableWallTime: 2024-11-21T15:22:20.109Z,
      syncSourceHost: '',
      syncSourceId: -1,
      infoMessage: '',
      electionTime: Timestamp({ t: 1732186290, i: 1 }),
      electionDate: 2024-11-21T10:51:30.000Z,
      configVersion: 29,
      configTerm: 28,
      self: true,
      lastHeartbeatMessage: ''
    }

Sakshi_Gupta3 · November 22, 2024, 5:30pm

Three points to note:

Since two nodes are in the EU and one is in the US, network latency might be causing issues, especially if the EU node is the one shut down, forcing acknowledgment from the distant US node. Check ping between US and EU nodes.
Check the replication lag for Secondary node.
Test with w:1 write concern as well.

zdi · November 26, 2024, 5:21am

Problem shows up regardless of the Secondary’s location
There is some replication lag, but not always.
Will do, but its not perfect for our case, that’s why we set it to w: 2.

Sakshi_Gupta3 · November 26, 2024, 6:21am

Examine the logs of both the primary and remaining secondary when one secondary goes down. Look specifically for any election changes or connection issues or replication delays.

Also, high resource usage on remaining nodes might affect the acknowledgement.
It is possible that you reached the thresholds for resources that causes the failure.