MongoS refusing connections during network split

( question also asked at cluster - MongoS refusing connections during network split - Server Fault )

I am exploring sharded MongoDB clusters, and I have trouble understanding MongoS’s behavior during a network split.

My test environment works as expected until I simulate a network split.
In the “smallest” network partition, I have:

  • N MongoS servers
  • only 1 config server (out of 3 total), in SECONDARY status
  • and N shard servers

In this partition:

  • at first, all MongoDB servers + MongoS still accept connections and behave as expected
  • if I restart these MongoDB+MongoS servers => MongoS servers start but refuse connections (MongoNetworkError: connect ECONNREFUSED).

Do you know why? I must be missing something obvious.
As far as I can tell from startup logs, MongoS sends queries like {"find":"shards", "readConcern":{"level":"majority"}} to the config server, which seems guaranteed to fail during a network split (Command timed out waiting for read concern to be satisfied).

Thanks for any help

I reproduced this behavior when following https://www.mongodb.com/docs/manual/tutorial/deploy-shard-cluster/ (using MongoDB 6.0.1) :

  • create and initialize a 3-member configuration server replica set => looks healthy, 1 PRIMARY + 2 SECONDARY
  • skip the shard replica set creation
  • start a MongoS server using this configuration replica set => MongoS accepts connections as expected
  • stop all servers (MongoDB + MongoS)
  • start only one configuration server MongoDB to simulate a network split => it accepts connections, and is in SECONDARY state as expected
  • start MongoS => it does not accept connections (ECONNREFUSED)

I expected MongoS to start and accept connections, even when the configuration replica set is in degraded mode (and only one SECONDARY configuration server is reachable by MongoS).

Is that expectation wrong?

Hey @Nick_Turner,

  • create and initialize a 3-member configuration server replica set => looks healthy, 1 PRIMARY + 2 SECONDARY
  • skip the shard replica set creation

Can I ask why you’re skipping the shard replica set creation in the above step? It sounds like you’re deliberately creating a mangled sharded cluster. In this state, I don’t think we can guarantee certain expected behaviour.

If you’re intending to simulate how a sharded cluster behave in a degraded state (which I think is a valid goal), I believe you need to have a working sharded cluster first, then use a tool like toxiproxy to simulate certain network conditions. I’m not sure if turning off part of the cluster will give you accurate information.

Regards,
Satyam

2 Likes

My initial test had a functioning sharded cluster (2 shards, each one on a 3-server replica set). That’s when I noticed this MongoS behavior when the cluster is in a degraded state.

I did a second test just to try and isolate the issue (reproducing the same behavior with fewer servers, while following the tutorial as much as possible). That’s why I skipped the shard replica set creation: even without it, I observed the same behavior. When my config replica set is healthy, MongoS accepts connections. But when the config replica set is in degraded state and the remaining servers are rebooted, MongoS refuses connections until the config replica set becomes healthy again.

Thanks for the toxiproxy recommendation, I’ll look into it. But I feel like turning off servers is also a valid test case: in case of datacenter emergencies, we may have to power off some servers quickly - and even in this state, I expect the remaining servers to still accept connections.