I am exploring sharded MongoDB clusters, and I have trouble understanding MongoS’s behavior during a network split.
My test environment works as expected until I simulate a network split.
In the “smallest” network partition, I have:
N MongoS servers
only 1 config server (out of 3 total), in SECONDARY status
and N shard servers
In this partition:
at first, all MongoDB servers + MongoS still accept connections and behave as expected
if I restart these MongoDB+MongoS servers => MongoS servers start but refuse connections (MongoNetworkError: connect ECONNREFUSED).
Do you know why? I must be missing something obvious.
As far as I can tell from startup logs, MongoS sends queries like {"find":"shards", "readConcern":{"level":"majority"}} to the config server, which seems guaranteed to fail during a network split (Command timed out waiting for read concern to be satisfied).
create and initialize a 3-member configuration server replica set => looks healthy, 1 PRIMARY + 2 SECONDARY
skip the shard replica set creation
start a MongoS server using this configuration replica set => MongoS accepts connections as expected
stop all servers (MongoDB + MongoS)
start only one configuration server MongoDB to simulate a network split => it accepts connections, and is in SECONDARY state as expected
start MongoS => it does not accept connections (ECONNREFUSED)
I expected MongoS to start and accept connections, even when the configuration replica set is in degraded mode (and only one SECONDARY configuration server is reachable by MongoS).
create and initialize a 3-member configuration server replica set => looks healthy, 1 PRIMARY + 2 SECONDARY
skip the shard replica set creation
Can I ask why you’re skipping the shard replica set creation in the above step? It sounds like you’re deliberately creating a mangled sharded cluster. In this state, I don’t think we can guarantee certain expected behaviour.
If you’re intending to simulate how a sharded cluster behave in a degraded state (which I think is a valid goal), I believe you need to have a working sharded cluster first, then use a tool like toxiproxy to simulate certain network conditions. I’m not sure if turning off part of the cluster will give you accurate information.
My initial test had a functioning sharded cluster (2 shards, each one on a 3-server replica set). That’s when I noticed this MongoS behavior when the cluster is in a degraded state.
I did a second test just to try and isolate the issue (reproducing the same behavior with fewer servers, while following the tutorial as much as possible). That’s why I skipped the shard replica set creation: even without it, I observed the same behavior. When my config replica set is healthy, MongoS accepts connections. But when the config replica set is in degraded state and the remaining servers are rebooted, MongoS refuses connections until the config replica set becomes healthy again.
Thanks for the toxiproxy recommendation, I’ll look into it. But I feel like turning off servers is also a valid test case: in case of datacenter emergencies, we may have to power off some servers quickly - and even in this state, I expect the remaining servers to still accept connections.