Production Environment Faults:failed to get connections from primary

All the members of a RS should be identical. It’s a good practice to ensure that you are actually HA and that any node can handle the workload.

Also, remember all the nodes are doing the write operations (using & maintaining indexes, etc). The secondaries work almost as much as the primary - especially if you are using readPreference options like nearest or SecondaryPreferred.

If the secondaries start to lag being the primary because they can’t keep up, this could be a big problem for the cluster because you won’t be able to use readConcern or writeConcern “majority” and this will start building cache pressure on the primary as the majority commit point will lag behind.

Do you have a single DB? Look like the DB turnright already has indexes for about 39 to 42 GB. Given that indexes need to be in memory, that’s already too much to handle for the secondaries with less RAM.

RAM = OS RAM + Connections + Indexes + Working Set + extra RAM for queries (aggregation, in memory sort, everything else). Looks like your “main” node has enough for the quantity of data I see here but the secondaries are too small.

I’m assuming that mongo-0, mongo-1 and mongo-2 are config with priority = 2 so they have a greater chance to be primary than the 2 other nodes.

I’m not sure if it’s possible or not but maybe there isn’t enough RAM on the secondaries to accept new connections from the primary when something needs to be done (replication, etc) as the secondaries are overloaded.

Cheers,
Maxime.

2 Likes