Query about network failure and behaviour of replica sets

Joanne · May 29, 2020, 1:49pm

Hi Team,

If we say there is a network partition and we saw two primaries in set transiently as per this link , why only one replica set faced heartbeat issue or network failure? Will not all replica sets go down if they are on same VM?

Stennie_X · May 29, 2020, 2:22pm

Hi,

Can you describe your scenario in more detail:

Is this a replica set or sharded cluster? (since you mention “all replica sets”)
How have you distributed members of the deployment across host VMs?
What is the network partition scenario you are envisioning?

If you have multiple replica set members on the same host VM (which would not be advisable for members of the same replica set), a single host VM failing will cause multiple members of your deployment to be unavailable.

Regards,
Stennie

Joanne · May 29, 2020, 2:36pm

Hi Stennie,
We have 1P 2S 1A sharded clusters members in a replica set and all members of a replica set on different hosts but there are 4 different replica sets and each set’s members are divided and distributed on different VMs.

Secondly, if VM-1 has primary of replSet01 and replSet03 and the transient primary is logged in replSet01 logs. Now if concluded that transient primaries are observed due to network failure then isn’t it obvious that replSet03 member on VM-1 will also log some network related issue?

Other question is if we say there was network issue, why mongo process gets affected why not other applications running on same VM complaint?

Stennie_X · May 29, 2020, 4:04pm

Hi,

Outcomes really depend on where the problem lies in your deployment. If there was a network connectivity problem on host VM-1, it would be reasonable to expect all instances on that host to be similarly affected.

However, each VM also has its own virtual network interfaces and resources. A perceived network issue from the point of view of replication could be the result of a specific VM being non-responsive to network heartbeat pings. The actual cause may be something administrative (for example, VM live migration or backup) or an issue with resource contention.

Is your question about an actual incident or a hypothetical scenario?

For an actual incident I suggest you try to create a timeline of activity based on the MongoDB and system log files from your deployment. Ideally you would have a monitoring/metrics system in place (for example, MongoDB Cloud Manager or Ops Manager) which would provide a starting point for your investigation.

Regards,
Stennie

Joanne · June 2, 2020, 6:29pm

Just some edge cases to consider, thanks for the reply though. It helps!

system · February 18, 2022, 8:44am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.