I have a sharded MongoDB cluster with tens of shards. Sometimes hardware or hypervisor failure leads to a situation when some random replica starts lagging. Queries to the problematic replica start to queue up. At some point they saturate connection pools on mongoses and it leads to cluster-wide denial of service.
Singe shard failure shouldn’t lead to entire cluster failure. After some considerations I came to the following options:
- Set up as low operation timeout as possible. This will limit connection pool usage by single shard. The problem with this approach is that I have lots of different requests with different typical latencies. Tuning them won’t be easy.
ShardingTaskExecutorPoolMaxSizeto set connection count limit to particular mongod instance.
Are these good solutions for the single shard partial failure problem? Does anybody know other solutions?