While deploying MongoDB 4.2.0 replica set - lets say a 3 node cluster onto on-prem VM systems. Its quite possible to have occasional network glitches. I was faced with a situation where in we had frequent failovers due to network glitches.
I tried to check online docs as to how we can make MongoDB tolerant to network glitches.
I found that we have to disable “enableElectionHandoff”. Once we do that mongoDB respects
“settings.electionTimeoutMillis” - default 10 seconds.
Lets say node A goes down, then it takes 10 seconds to decide who must be the next primary. So after 10 seconds, lets say node B conducts election and becomes primary. So disabling “enableElectionHandoff” works well.
Lets take a situation where NodeA suffers network glitch, and its not visible to Node B and Node C and it comes back online after 5 seconds. Now I expect Node A to become primary automatically. But what happens is that Node A joins back, now all the 3 nodes are in secondary mode. At the end of 10 seconds, Node B becomes primary and not Node A.
My initial assumption was that I have made mongoDB tolerant to network failures. But thats not seem to be happening here. Either ways with or without “enableEletionHandoff” we have a situation where Node B becomes primary and we have to manually failback.
How do we deal with this? How do we make mongodb tolerant to network errors.