i am currently running a two-node replica set with a systemd
configuration resembling the following:
[Service]
User=mongod
Group=mongod
Environment="OPTIONS=-f /etc/mongod.conf"
Environment="MONGODB_CONFIG_OVERRIDE_NOFORK=1"
Environment="LD_PRELOAD=/usr/lib64/libjemalloc.so.2"
...
MemoryHigh=501M
MemoryMax=499M
Restart=always
for our external users, this replica set supports queries with secondaryPreferred
read preference. both nodes occasionally collapse due to OOM, and sometimes the replica set will detect this, mark the crashed node as '(not reachable/healthy)'
, and reroute queries to the surviving node.
however, recently we have begun noticing instances in which one of the nodes will slow down dramatically (most likely due to memory pressure), and the replica set will continue to believe the throttled node is healthy and refer queries to it. for example, earlier today the SECONDARY
node was clearly hung (all user queries were timing out), yet it still remained part of the replica set when we inspected the two nodes from mongosh
.
from the perspective of the SECONDARY
:
members: [
{
_id: 0,
name: 'venus.swiftinit.org:27017',
health: 1,
state: 2,
stateStr: 'SECONDARY',
uptime: 87173,
optime: { ts: Timestamp({ t: 1705975729, i: 1 }), t: Long('82') },
optimeDate: ISODate('2024-01-23T02:08:49.000Z'),
lastAppliedWallTime: ISODate('2024-01-23T02:08:49.318Z'),
lastDurableWallTime: ISODate('2024-01-23T02:08:49.318Z'),
syncSourceHost: 'juno.swiftinit.org:27017',
syncSourceId: 1,
infoMessage: '',
configVersion: 101786,
configTerm: 82,
self: true,
lastHeartbeatMessage: ''
},
{
_id: 1,
name: 'juno.swiftinit.org:27017',
health: 1,
state: 1,
stateStr: 'PRIMARY',
uptime: 249,
optime: { ts: Timestamp({ t: 1705975729, i: 2 }), t: Long('82') },
optimeDurable: { ts: Timestamp({ t: 1705975729, i: 2 }), t: Long('82') },
optimeDate: ISODate('2024-01-23T02:08:49.000Z'),
optimeDurableDate: ISODate('2024-01-23T02:08:49.000Z'),
lastAppliedWallTime: ISODate('2024-01-23T02:08:49.974Z'),
lastDurableWallTime: ISODate('2024-01-23T02:08:49.974Z'),
lastHeartbeat: ISODate('2024-01-23T02:10:58.732Z'),
lastHeartbeatRecv: ISODate('2024-01-23T02:10:56.245Z'),
pingMs: Long('249'),
lastHeartbeatMessage: '',
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
electionTime: Timestamp({ t: 1705975643, i: 2 }),
electionDate: ISODate('2024-01-23T02:07:23.000Z'),
configVersion: 101786,
configTerm: 82
}
],
from the perspective of the PRIMARY
:
members: [
{
_id: 0,
name: 'venus.swiftinit.org:27017',
health: 1,
state: 2,
stateStr: 'SECONDARY',
uptime: 347,
optime: { ts: Timestamp({ t: 1705975787, i: 653 }), t: Long('82') },
optimeDurable: { ts: Timestamp({ t: 1705975787, i: 653 }), t: Long('82') },
optimeDate: ISODate('2024-01-23T02:09:47.000Z'),
optimeDurableDate: ISODate('2024-01-23T02:09:47.000Z'),
lastAppliedWallTime: ISODate('2024-01-23T02:09:47.190Z'),
lastDurableWallTime: ISODate('2024-01-23T02:09:47.190Z'),
lastHeartbeat: ISODate('2024-01-23T02:10:45.922Z'),
lastHeartbeatRecv: ISODate('2024-01-23T02:10:45.335Z'),
pingMs: Long('1154'),
lastHeartbeatMessage: '',
syncSourceHost: 'juno.swiftinit.org:27017',
syncSourceId: 1,
infoMessage: '',
configVersion: 101786,
configTerm: 82
},
{
_id: 1,
name: 'juno.swiftinit.org:27017',
health: 1,
state: 1,
stateStr: 'PRIMARY',
uptime: 350,
optime: { ts: Timestamp({ t: 1705975839, i: 1 }), t: Long('82') },
optimeDate: ISODate('2024-01-23T02:10:39.000Z'),
lastAppliedWallTime: ISODate('2024-01-23T02:10:39.320Z'),
lastDurableWallTime: ISODate('2024-01-23T02:10:39.320Z'),
syncSourceHost: '',
syncSourceId: -1,
infoMessage: '',
electionTime: Timestamp({ t: 1705975643, i: 2 }),
electionDate: ISODate('2024-01-23T02:07:23.000Z'),
configVersion: 101786,
configTerm: 82,
self: true,
lastHeartbeatMessage: ''
}
],
what is going on here?