DR = Disaster Recovery here, correct?
Usually when you plan for a DR plan, you try to see how long it would take for you to recreate the entire cluster from a backup which is different from what you are doing here.
So, at the beginning, you had 3 nodes (P+S+S) in DC1.
Now you have 2 nodes with priority=2 (P+S) in DC1 and 3 nodes (S+S+A) on DC2.
A better solution would be to keep DC1 untouched (P+S+S) and add 2 more (S+S) in DC2. If you prefer to have the P in DC1, you can indeed set p=2 in all the nodes in DC1 and p=1 (default value) in all the nodes in DC2. You can keep the machine with the small disk in DC2 for your next pet project !
That’s correct. And note that this would still be true with your setup (P+S in DC1 & S+S+A in DC2) or my proposed setup: (P+S+S in DC1 & S+S in DC2).
A difference though between my proposition and yours. Let’s imagine that there is a 1 sec latency between the 2 DCs (NY & Tokyo). In my config, I can answer read & write queries with readConcern majority or writeConcern majority by only reaching the 3 nodes in DC1 (which of course will be the first to replicate the data in 99.9% of the cases because of the extra 1sec latency with DC2). Which means that my DC1 can answer in a few milliseconds. With your setup, you always have to wait for the replication to reach DC2 to get the majority - meaning that all your queries will take 2 sec at least I think.
That’s almost correct - but actually completely wrong . Let me explain.
Now your cluster is configured with 5 nodes (true for your setup or my proposed one). Which means that majority = 3. You are correct that your PSA in DC2 will be able to elect a primary, but, majority is still =3. The fact that P+S are dead in DC1 doesn’t change that - unless you reconfigure your Replica Set (RS) to remove them completely from the equation - which is a bad idea if you want to restart them and recover from the oplog.
So now you have a PSA in DC2 with majority = 3 but one node (the A!) doesn’t have the data. So to be clear, the arbiter counts for the votes in the majority (that’s actually it’s sole purpose in life) but not in the majority for the data. So with your PSA in DC2, you cannot satisfy read & write queries asking for the majority. Worst that that, your P & S both have to keep in memory the entire history of the write operations that are still happening with w=1 or w=2 because the majority commit point (of WiredTiger) cannot move forward anymore => This builds up cache pressure + all read & write queries with majority are in timeout.
With my proposed solution, this problem cannot happen because I don’t have an arbiter. If I can elect a node (so I have at least 3 nodes up and running), as they are all data bearers, they all counts for both the majority in the votes and in the data commit points.
Coming back to the initial discussion about creating a Disaster Recovery Plan: adding 2 extra secondaries won’t help by definition because “disaster” means that the entire cluster is dead somehow and the only solution left to stay online is to recover from a backup.
Note that adding 2 secondaries well help to AVOID this situation entirely but when it’s dead, it’s dead.
So, to plan for a DRP with these 6 machines (including the crippled one that you can forget - no disk = useless in all the scenarios), I would just stick with a P+S with p=2 in DC1 and just one S with p=1 in DC2. I would then use the 2 other machines (1 in DC1 & 1 in DC2) to save backups.
If you ever need to recover from these backups, I would recreate the 2 nodes in DC1 using the data in the backup machine in DC1 (close so faster data transfer). Same for DC2.
Adding 2 extra secondaries will help prevent this from ever happening as you get 2 extra “chances” - but this won’t save you if you get hacked, someone gets access to a root user and deletes all your data - oplog included. Then your only option is to rebuild everything from your latest backup.
I hope this helps .