Hello,
We have a 3 node PSA deployment that is currently running MongoDB 3.6. They are all RHEL 7.9 Virtual Machines.
Historically, we have not had any issues with this setup but recently the mongod service on our Arbiter has been crashing and giving us this error:
2023-01-18T04:58:12.204-0500 F - [replexec-200] Invariant failure opTime.getTimestamp().getInc() > 0 Impossible optime received: { ts: Timestamp(1674035892, 0), t: 252 } src/mongo/db/repl/replication_coordinator_impl.cpp 1213
2023-01-18T04:58:12.204-0500 F - [replexec-200] \n\n***aborting after invariant() failure\n\n
2023-01-18T04:58:12.375-0500 F - [replexec-200] Got signal: 6 (Aborted).
The mongod service is able to be immediately restarted after the crash, but will crash again as early as two days later.
I am not experienced with MongoDB, and have been trying to do some research here and on the main MongoDB support. However, I have not seen any mention of this specific Invariant failure.
Currently, we are having issues authenticating to the DB on the Arbiter to provide any rs.status() or rs.conf() outputs, but we are able to authenticate to the DB on our test environment and see that information there if it is needed.
I have read posts about the concerns of Arbiters and certain benefits of HA that are impeded with the Arbiter implementation, but unfortunately I was not with the company when the decision was made to add the Arbiter. Furthermore, the Primary and Secondary nodes are managed by a Third Party company that maintains their bespoke application running on the nodes. Understandably this adds complexity and leads to why we are still on an older version of MongoDB.
Additional information in the Log File after crash/service restart:
2023-01-18T06:20:02.305-0500 I REPL [replexec-0] ** WARNING: This replica set node is running without journaling enabled but the
2023-01-18T06:20:02.305-0500 I REPL [replexec-0] ** writeConcernMajorityJournalDefault option to the replica set config
2023-01-18T06:20:02.305-0500 I REPL [replexec-0] ** is set to true. The writeConcernMajorityJournalDefault
2023-01-18T06:20:02.305-0500 I REPL [replexec-0] ** option to the replica set config must be set to false
2023-01-18T06:20:02.305-0500 I REPL [replexec-0] ** or w:majority write concerns will never complete.
2023-01-18T06:20:02.305-0500 I REPL [replexec-0] ** In addition, this node's memory consumption may increase until all
2023-01-18T06:20:02.305-0500 I REPL [replexec-0] ** available free RAM is exhausted.
2023-01-18T06:20:02.305-0500 I REPL [replexec-0]
2023-01-18T06:20:02.305-0500 I REPL [replexec-0] New replica set config in use: { _id: "<rsName>", version: 489159, protocolVersion: 1, members: [ { _id: 1, host: "node2:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 2, host: "node1:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 2.0, tags: {}, slaveDelay: 0, votes: 1 }, { _id: 3, host: "node3:27017", arbiterOnly: true, buildIndexes: true, hidden: false, priority: 0.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatIntervalMillis: 2000, heartbeatTimeoutSecs: 10, electionTimeoutMillis: 10000, catchUpTimeoutMillis: 60000, catchUpTakeoverDelayMillis: 30000, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 }, replicaSetId: ObjectId('59c5a4175d7da122417038c8') } }
2023-01-18T06:20:02.305-0500 I REPL [replexec-0] This node is node3.mfps.com:27017 in the config
2023-01-18T06:20:02.305-0500 I REPL [replexec-0] transition to ARBITER from STARTUP
2023-01-18T06:20:02.320-0500 I REPL [replexec-0] Member node2:27017 is now in state SECONDARY
2023-01-18T06:20:02.332-0500 I REPL [replexec-1] Member node1:27017 is now in state PRIMARY
I also saw in the documentation regarding MajorityReadConcern being enabled in a PSA architecture leading to performance issues, but I am not sure if that applies here. I can confirm that on Node1 and Node2, enableMajorityReadConcern is set to true in mongod.conf while Node3 is set to false. We have also not seen any evidence of the RAM being exhausted leading up to this crash as cautioned in the log above.
Any further guidance on where to troubleshoot this issue would be greatly appreciated. Thank you to anyone who takes time to read this.