readConcern with more than 1 secondary and an arbiter

Tamar_Nirenberg · October 12, 2021, 9:35am

Hi,

I have a mongodb which includes 5 nodes - 1 Primary, 3 secondaries, and 1 arbiter.

When I start Mongo, I see this warning:
REPL [replexec-0] ** WARNING: This replica set uses arbiters, but readConcern:majority is enabled

The Mongo documentation only descibes a setup of 1 Primary, 1 secondary and 1 arbiter, saying that no najority will be avilable if Primary fails.

with my setup then, should readConcern:majority be disabled indeed?

Thanks,
Tamar

MaBeuLux88 · October 12, 2021, 5:04pm

Hi @Tamar_Nirenberg,

Arbiters are not recommended in production environments in general because they can create “weird” situations that can be dangerous for the stability of the RS.

Let’s take your example: 4 nodes + 1 arbiter. Your majority = 3. So this means that – in theory – you should still be up and running if 2 nodes are suddenly out of the picture (data center that contains 2 nodes has a connection failure for example).
Yet, it’s not the case here. If 2 nodes fails, you will end up with 2 nodes (P+S) and one arbiter. The arbiter participates in the majority for the votes, but not in the majority for the majority commit point. This will force the majority commit point to lag behind which will start to create some cache pressure on your storage engine that will need to keep all the changes that happen after the commit point on disk to retain a durable history.

See Performance Issues with PSA replica sets and Mitigate Performance Issues with PSA Replica Set.

The consequence is that if 2 nodes fail, your majority commit point cannot move forward and you cannot read anything with readConcern=majority + you are building cache pressure (==timed bomb).

Same problem if you try to write something with writeConcern=“majority” (==3). As only 2 nodes are really bearing data in your PSA cluster (S+S down) => You cannot write.

Conclusion: Don’t use an Arbiter in production environment to avoid issues. If you do, you shouldn’t use writeConcern=“majority”. You are stuck at maximum w=2 in this scenario and you cannot use readConcern=“majority”. Note that doing this doesn’t solve the problem of the cache pressure that will start to build up in this situation as soon as you are in this state.

Cheers,
Maxime.

Tamar_Nirenberg · October 13, 2021, 9:57am

Hi Maxime,

Thank you for the detailed reply.

Let me describe the situation I have, and then ask some questions:

My original setting for the Mongo was a Primary with two secondaries, all in the same site.

I was then asked to create a DR for this site for Mongo.
Hardware for the remote site was already bought and configured - 3 servers, only two with sufficient disk space to accomodate our Mongo
(which is ~400 GB).
I have then 6 avialable servers now, 3 on the main site, and 3 on the remote site, where one of them on the remote site is with small disk space.

To enable Mongo survival in a case of main site failure, I did the following:

Moved one secondary from the main site to the remote site
Created an additional secondary on the remote site
Created an arbiter on the remote site - on the server with little disk space.
Changed the priroity of the Primary and secondary on the main site to 2
Priority for the secondaries on the remote site set to 1

My logic was then:

If Primary on main site fails - still have majority for read and write (4/5), and the secondary on the main site will become a Primary
(as it has the higher priority).
If all main site nodes fail (Primary and one seconday) - remote site will have now Primary-seconday-arbiter:
still have a majority for read and write (3/5)

Do I miss something in my logic?

MaBeuLux88 · October 13, 2021, 2:27pm

DR = Disaster Recovery here, correct?

Usually when you plan for a DR plan, you try to see how long it would take for you to recreate the entire cluster from a backup which is different from what you are doing here.

So, at the beginning, you had 3 nodes (P+S+S) in DC1.
Now you have 2 nodes with priority=2 (P+S) in DC1 and 3 nodes (S+S+A) on DC2.

A better solution would be to keep DC1 untouched (P+S+S) and add 2 more (S+S) in DC2. If you prefer to have the P in DC1, you can indeed set p=2 in all the nodes in DC1 and p=1 (default value) in all the nodes in DC2. You can keep the machine with the small disk in DC2 for your next pet project !

That’s correct. And note that this would still be true with your setup (P+S in DC1 & S+S+A in DC2) or my proposed setup: (P+S+S in DC1 & S+S in DC2).
A difference though between my proposition and yours. Let’s imagine that there is a 1 sec latency between the 2 DCs (NY & Tokyo). In my config, I can answer read & write queries with readConcern majority or writeConcern majority by only reaching the 3 nodes in DC1 (which of course will be the first to replicate the data in 99.9% of the cases because of the extra 1sec latency with DC2). Which means that my DC1 can answer in a few milliseconds. With your setup, you always have to wait for the replication to reach DC2 to get the majority - meaning that all your queries will take 2 sec at least I think.

That’s almost correct - but actually completely wrong . Let me explain.
Now your cluster is configured with 5 nodes (true for your setup or my proposed one). Which means that majority = 3. You are correct that your PSA in DC2 will be able to elect a primary, but, majority is still =3. The fact that P+S are dead in DC1 doesn’t change that - unless you reconfigure your Replica Set (RS) to remove them completely from the equation - which is a bad idea if you want to restart them and recover from the oplog.
So now you have a PSA in DC2 with majority = 3 but one node (the A!) doesn’t have the data. So to be clear, the arbiter counts for the votes in the majority (that’s actually it’s sole purpose in life) but not in the majority for the data. So with your PSA in DC2, you cannot satisfy read & write queries asking for the majority. Worst that that, your P & S both have to keep in memory the entire history of the write operations that are still happening with w=1 or w=2 because the majority commit point (of WiredTiger) cannot move forward anymore => This builds up cache pressure + all read & write queries with majority are in timeout.

With my proposed solution, this problem cannot happen because I don’t have an arbiter. If I can elect a node (so I have at least 3 nodes up and running), as they are all data bearers, they all counts for both the majority in the votes and in the data commit points.

Coming back to the initial discussion about creating a Disaster Recovery Plan: adding 2 extra secondaries won’t help by definition because “disaster” means that the entire cluster is dead somehow and the only solution left to stay online is to recover from a backup.
Note that adding 2 secondaries well help to AVOID this situation entirely but when it’s dead, it’s dead.

So, to plan for a DRP with these 6 machines (including the crippled one that you can forget - no disk = useless in all the scenarios), I would just stick with a P+S with p=2 in DC1 and just one S with p=1 in DC2. I would then use the 2 other machines (1 in DC1 & 1 in DC2) to save backups.
If you ever need to recover from these backups, I would recreate the 2 nodes in DC1 using the data in the backup machine in DC1 (close so faster data transfer). Same for DC2.

Adding 2 extra secondaries will help prevent this from ever happening as you get 2 extra “chances” - but this won’t save you if you get hacked, someone gets access to a root user and deletes all your data - oplog included. Then your only option is to rebuild everything from your latest backup.

I hope this helps .

Cheers,
Maxime.

Tamar_Nirenberg · October 14, 2021, 9:36am

i Maxime,

Thank again for the detailed reply.

My main concern was that if the whole main site, with a P+S+S configuration, goes down - I will be left with 2 survivng nodes. Since this is not a majority (2/5) non will be elected to become a primary, and I will be left with no active system.

If I remeber correctly, this is indeed what happened when I added two secondaries to the cluster (P+S+S+S+S) and stopped the primary and two secondaries - none was elected…

Regarding recovering from backup - unfortuantally as the only tool I can use is mongodump, with these sizes (and expected to grow), it took many hours - not acceptable for my customer to have his system down for such a long period, so I canno count on it unless absolutely everything else fails, and I need to rescue the data somehow.

Regarding your first point - true, read conecern majority with my setup (P+S on main site, S+S+A on remote) can cause queries to become slower since they have to wait for a secondary on the remote to respond.
But maybe that can be fixed with changing the read concern to something else?

Thanks,
Tamar

system · October 19, 2021, 9:36am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.