Way to check the down of the primary

Kim_Hakseon · July 9, 2021, 4:23am

We are testing internally for failure and fail-over.
So similar to the problem, I killed the primary with kill -9 command.

As a result, it took too long to fail-over.
Voting and promotion took less than a second, but the primary’s down was recognized too late, taking more than 30 seconds to fail-over overall.

Is there a faster way to check the down of the primary?

tapiocaPENGUIN · July 9, 2021, 6:29pm

From the MongoDB Docs

Lowering the electionTimeoutMillis replication configuration option from the default 10000 (10 seconds) can result in faster detection of primary failure. However, the cluster may call elections more frequently due to factors such as temporary network latency even if the primary is otherwise healthy. This can result in increased rollbacks for w : 1 write operations.

Source

Kim_Hakseon · July 10, 2021, 2:33am

According to the log, the primary’s health check is performed several times before the actual fail-over is performed, and the primary’s down is recognized and the primary’s down is performed.

I want to reduce the time spent recognizing primary down.
Are there any related options?

kevinadi · July 19, 2021, 1:28am

Hi @Kim_Hakseon

Voting and promotion took less than a second,

Do you mean that after less than a second after the kill, a new primary was elected?

but the primary’s down was recognized too late, taking more than 30 seconds to fail-over overall.

Do you mean that the app recognized the new primary 30 seconds after the kill?

I would concur with the link posted by @tapiocaPENGUIN . The election timeout is the only configuration parameter as far as I’m aware. The default setting was selected to allow for network hiccups or unexpected transient hardware slowness to occur. Otherwise every time there’s some temporary issue that resolved itself within seconds, an election is held, which could be detrimental to the operation of the replica set in the long run.

If, however, you would like quicker election if you want to step down the current primary using the rs.stepDown() command (for e.g. maintenance), then in MongoDB 4.0.2 there’s the new enableElectionHandoff flag (default is enabled) that can reduce the downtime.

Best regards
Kevin