Single shard partial failure handling

Sergey_Zagursky · October 4, 2022, 2:59pm

Hi!

I have a sharded MongoDB cluster with tens of shards. Sometimes hardware or hypervisor failure leads to a situation when some random replica starts lagging. Queries to the problematic replica start to queue up. At some point they saturate connection pools on mongoses and it leads to cluster-wide denial of service.

Singe shard failure shouldn’t lead to entire cluster failure. After some considerations I came to the following options:

Set up as low operation timeout as possible. This will limit connection pool usage by single shard. The problem with this approach is that I have lots of different requests with different typical latencies. Tuning them won’t be easy.
Use ShardingTaskExecutorPoolMaxSize to set connection count limit to particular mongod instance.

Are these good solutions for the single shard partial failure problem? Does anybody know other solutions?

Thanks!

Tarun_Gaur · October 11, 2022, 5:30am

Hello @Sergey_Zagursky ,

Welcome to The MongoDB Community Forums!

Could you please help me with below details to know more about your use-case?

MongoDB Server version?
What is your shard configuration (example PSA, PSS etc…) ?
In case Primary node on your replica set is down and secondary is/are available means a shard without a primary will still be available for reads. Do you want all operations to timeout or only write operations?

Regards,
Tarun

Sergey_Zagursky · October 11, 2022, 6:47am

We are currently on MongoDB 4.4.10 with PSS configuration.

The problem is that if single primary suddenly goes 100x slower connections to it immediately saturate mongos connection pools and client connection pools. And from client perspective it looks like all MongoDB cluster is not responding. What I want here is that MongoDB cluster remained operational with the only exception of serving requests to degraded shard.

Tarun_Gaur · October 14, 2022, 1:04pm

Do you find a pattern in these failures and what was the root cause for these issues(swapping, hardware issues, network issues or any other)?
It could be that your cluster is running at full hardware capacity and for some reason a small failure leads to a much larger one? Have you considered upgrading hardware just to see if failure still occurs? Alternatively, depending on the use case, is it possible to add more shards?
4.4.17 is the latest in 4.4 series. There are improvements made between 4.4.10-4.4.17 that may help, so upgrading to the newest version may show us that this is not caused by any fixed issues.

Lastly, what is the Driver version you are using?

Sergey_Zagursky · October 14, 2022, 1:30pm

The problem with cluster failure is not the failure I want to address.

The main reasons of such failures are hardware failures and human mistakes during manual maintenance procedures.

No, it is not running at full capacity. Adding more shards will just increase probability of single shard failure.

No, we haven’t tried newer versions and the problem I’m talking about is irrelevant to MongoDB version. Hardware failures are equally deadly to any MongoDB version and what I’m asking here is how to continue serving requests to alive shards.

We are using latest version of official Go MongoDB driver. But again this is hardly relevant.

Tarun_Gaur · October 19, 2022, 5:23am

Without knowing your full deployment details and use case, since originally the question is about single shard failure, I was thinking that providing more horizontal scaling might alleviate the issue. But again this depends on details that we are not familiar with.

Newer MongoDB versions have bugfixes and new features that might alleviate certain issues. Upgrading to a newer version ensures that you are not experiencing issues that are already fixed.

Hardware failures are just as deadly to any other database and/or applications so this is not limited to MongoDB

If some queries depends on an unavailable shard, it may be that the application floods the database with requests that doesn’t timeout or have long timeouts. One possible solution is to limit the timeout for queries for example by using wtimeout for write operations and/or maxTimeMS() for read operations, but this needs to be balanced with possible network latencies or disk latencies so the app doesn’t give up too quickly when the hardware is just preparing to answer the query.

PSS replica sets are usually reasonably resilient to failure, but if you’re having trouble with operational issues and you’re open to using a hosted service, you might want to consider using MongoDB Atlas which will take care of these operational concerns for you.

Sergey_Zagursky · October 19, 2022, 6:48am

Thanks! As I said previously, we’re already considering tightening our timeouts to mitigate the extent of problem. Are there any pool settings that would prevent single shard from saturating entire mongos pools? I found ShardingTaskExecutorPoolMaxSize setting but it only limits connections to mongod and incoming pool on mongos still saturates.