Replica set with 3 DB Nodes and 1 Arbiter

Kasun_Magedaragama · June 18, 2020, 2:39am

Hi,

I have created a replicaset of 3 database nodes and one arbiter.
Where 2 database nodes in Production environment and DR arbiter and another secondary DB node. Arbiter also holds a vote in my configuration,
As if one node fails,
in order to promote one secondary as a primary, there must be at least 3 votes.
Is my configuration is wrong?
4 voting members including arbiter is wrong or correct?
Please help?

michael_hoeller · June 18, 2020, 5:43am

Hello @Kasun_Magedaragama ,
and welcome to the community. We’re glad to have you join us and we look forward to your contributions.

An even number of voting members is always a bad choice for elections. You generally add a cheap (non dataholding) arbiter to archive an odd number of voting members - you can look at an arbiter as an tie-breaker. The MongoDB Documentation provides further detail

–
Beside this you should also keep the MongoDB write concern in mind.

write concern allows your applications to request a certain number of acknowledgments for a write operation for the MongoDB cluster.
These acknowledgments represent increasing durability guarantees for a given write operations.
Write concern comes with a trade off of write speed.
The higher guarantee you need that a right is durable, the more time the overall write operation requires to complete.
Replica sets, and sharded clusters all support write concern.

A write concern of zero means that the application doesn’t wait for any acknowledgments. The write might succeed or fail. The application doesn’t really care. It only checks that it can connect to the node successfully.

The default write concern is one . That means the application waits for an acknowledgment from a single member of the replica set, specifically, the primary. This is a baseline guarantee of success.

Write concerns greater than one increase the number of acknowledgments to include one or more secondary members. Higher levels of write concern correspond to a stronger guarantee of write durability.

Majority is a keyword that translates to a majority of replica set members. Divide the number of members by two and round up. So this three-member replica set has a majority of two. A five-member replica set would have a majority of three-- so on and so forth. The nice thing with majority is that you don’t have to update your write concern if you increase the size of your replica set.

Hope this helps
Michael

Kasun_Magedaragama · June 18, 2020, 7:23am

Dear Michael,

Thanks a lot for you response. What we thought is, if one DB node goes down,
In order to elect a Primary, there should be remaining 3 nodes right?
That means 2 DB nodes + Arbiter?

Correct me if im wrong

michael_hoeller · June 19, 2020, 2:45pm

Hello @Kasun_Magedaragama

sorry for the late response, unfortunately work kept me busy.

I struggle with a clear answer as well. If your primary member goes down when running in a standard configuration (odd number of hosts) then you run into the risk of a tie when voting. Maybe @Stennie_X can add more information why an odd number is the recommended setup.

My understanding of the voting process is the following:

The primary node is the first point of contact for any client communicating to the database. Even if secondaries go down, the client will continue communicating with the node acting as primary until the primary is unavailable.

Elections take place whenever there’s a change in topology. Reconfiguring a replica set will always trigger an election that may or may not elect a new primary. But you will definitely see a new primary elected in two cases:

anytime the current primary node becomes unavailable, or
when the current primary node steps down to be a secondary.

The method to figure out which secondary will run for election begins with priority and whichever node has the latest copy of the data. Let’s say every node in your set has the same priority, which is the default. And this node has the latest copy of the data. So it’s going to run for election, and then automatically vote for itself. Then it’s going to ask the other two node(s) for support in the election. The response should be: you have a pretty recent copy of the data, you seem like a good candidate. Then they’ll pledge their support as well. This node will be elected primary.

There is also the very slim possibility that two nodes run for election simultaneously. But in a replica set with an odd number of nodes, this doesn’t matter.

These two nodes are both going to run, which means they’re both going to vote for themselves. And then this node is going to essentially decide which one of these nodes becomes primary by virtue of a tiebreaker.
This becomes a problem when we have an even number of voting members in a set.

If two secondaries are running for election simultaneously and there are an even number of remaining nodes in the set, there’s a possibility that they split the vote and there’s a tie. Now a tie is not the end of the world, because the nodes will just start over and hold another election. The problem with repeating elections over and over is that any applications accessing the data will have to pause all activity and wait until a primary is elected. An even number of nodes increases the chances an election has to be repeated, so we generally try to keep an odd number in our replica sets.

Another important aspect of elections is the priority assigned to each node in a set. Priority is essentially the likelihood that a node will become the primary during an election. The default primary for a node is 1, and any node with priority 1 or higher can be elected primary. You can increase the priority of a node if you want it to be more likely at this node becomes primary. But changing this value alone does not guarantee that.
You can also set the priority of node to be 0 if you never want that node to become primary. A priority 0 node can still vote in elections, but it can’t run for election.

Michael

chris · June 19, 2020, 3:12pm

To elect a primary a majority of voting members must vote for the candidate.

I believe majority is expressed as floor(voting_members_in_cluster / 2 ) +1. ( I cannot find a reference for this in the documentation right now)

In the case an even number sized cluster is split(with 2 nodes partitioned on each side) it is impossible to gain a majority and hence no candidate elected for primary.

michael_hoeller · June 19, 2020, 3:54pm

Hi

I am aware of that rule but in a slightly different context:

MongoDB write concern is an acknowledgment mechanism that developers can add to write operations.
Higher levels of acknowledgment produce a stronger durability guarantee. Durability means that the write has propagated to the number of replica set member nodes specified in the write concern.

Majority here is defined as a simple majority of replica set members. So divide by two, and round up.
Taken from M103: Basic Cluster Administration

I was under the same assumption as you. But I found nowhere a statement that an election is blocked with an even number of members and I doubt that it would.

Michael

Doug_Duncan · June 20, 2020, 4:18am

@chris you are correct. If you have a four node replicaset with two nodes each in two different data centers and there is a network partition making it so you cannot get a majority vote count, then you will be left with no PRIMARY server and four SECONDARY servers. This is why it is recommended to have an odd number of voting members with a majority of nodes in the PRIMARY data center, or even better, spread across three or more data centers if possible.

Stennie_X · June 21, 2020, 1:59am

Welcome to the community @Kasun_Magedaragama!

As @michael_hoeller mentioned, an odd number of voting members is recommended. The addition of an arbiter to your 3 member deployment adds operational risk without providing any benefit.

Primaries are elected (and sustained) based on a consensus vote from a strict majority (>50%) of configured voting members. The strict majority requirement is to avoid situations that might otherwise allow more than one primary (for example, a network partition separating voting members equally between data centres).

If you add additional members to a replica set, there should be some motivation such as increasing data redundancy or improving fault tolerance. Adding an arbiter to a 3 member replica set does not contribute to either of those aspects.

With 3 voting members, the strict majority required to elect a primary is 2 votes which means there is a fault tolerance of 1 member that can be unavailable. If a majority of voting members aren’t available, a primary cannot be elected (or sustained) and all data-bearing members will transition to SECONDARY state.

What are the consequences of adding an arbiter?

With 4 voting members, the strict majority required to elect a primary is 3 votes which means there is still a fault tolerance of 1 despite the added member. There is also no improvement in data redundancy, since an arbiter only participates in elections.

However, if the 4th member is an arbiter this introduces some potential operational complications when the replica set is running in a degraded state (elected primary with one data-bearing member unavailable):

An arbiter contributes to the voting majority for a replica set election but cannot contribute to acknowledgement of write operations (since an arbiter doesn’t write any data).
If you want to avoid potential rollback of replica set writes, a majority write concern is recommended. However, a majority write concern cannot be acknowledged if your replica set currently only has a voting majority (using an arbiter) rather than a write majority. Operations with majority write concern will either block indefinitely (default behaviour) or time out (if you have specified the wtimeout option).
Cache pressure will be increased because more data will be pinned in cache waiting for the majority commit point to advance. Depending on your workload, this can cause significant problems if your replica set is in degraded state. There is a startup warning for Primary-Secondary-Arbiter (PSA) deployments which also would apply to your PSSA scenario in MongoDB 4.4 and earlier: Disable Read Concern Majority. For MongoDB 5.0+, please see Mitigate Performance Issues with PSA Replica Set.

Should you add an arbiter?

Typically, no.

An arbiter can still be useful if you understand the operational caveats and are willing to compromise robustness for cost savings.

Considering an arbiter as a tie-breaker is the best possible interpretation, but adding an arbiter does not have the same benefits as a secondary.

Where possible I would strongly favour using a data-bearing secondary over an arbiter for a more robust production deployment.

How is voting majority determined?

The required voting majority is based on the configured number of replica set members, not on the number that are currently healthy. The voting majority for a replica set with 4 voting members is always 3 votes.

Think of replication as analogous to RAID storage: your configuration determines the level of data redundancy, performance, and fault tolerance. If there is an issue with availability of one (or more) of your replica set members, the replica set will run in a degraded mode which allows continued write availability (assuming you still have a quorum of healthy voting members to sustain a primary) and read availability (as long as at least one data-bearing member is online).

What about even number of voting members?

Explanations around replica set configuration are often reductive to try to provide more straightforward understanding.

The election algorithm can handle an even number of voting members (for example, this is the scenario when you have a 3 member replica set with 1 member down). There are other factors that influence elections including member priority and freshness, so repeated tie-breaking scenarios should not be a concern. However, you generally want to remove any potential speed bumps for your normal deployment state (“all members healthy”). A configuration with an odd-number of voting members is also easier to rationalise when considering scenarios for data redundancy and fault tolerance.

MongoDB’s historical v0 election protocol (default in MongoDB 3.0 and earlier) only supported one election at a time, so any potential ties had a more significant impact on elapsed time to reach consensus. The modern v1 election protocol (default in MongoDB 3.2+) supports multiple concurrent elections for faster consensus. If you want to learn more, there’s a relevant talk from MongoDB World 2015: Distributed Consensus in MongoDB.

How is majority calculated?

That is the correct determination for voting majority, but write majority is based on data-bearing members. Ideally those calculations should be the same for a deployment, but arbiters and members with special configuration (eg delayed secondaries) will have consequences for write acknowledgements.

In MongoDB 4.2.1+, the rs.status() output now has explicit majorityVoteCount and writeMajorityCount calculations to remove any uncertainty.

Regards,
Stennie

system · June 26, 2020, 1:59am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.