No stepdown on EBS Volume failure

Joao_Santos · March 24, 2021, 7:37pm

Hi,

We recently had a problem where after performing a resize operation on a EBS volume, the volume completely stopped responding to MongoDB queries for some minutes. We recovered from this state by forcing a restart on the MongoDB primary host, which triggered the failover to a secondary. We did try to execute a stepdown on the primary, but it did not have any effect, which forced us to move to the restart the server option.

There was no automatic failover (i.e. the primary stepping down on its own) because even though the data volume was not responding, the mongo process was still up and running and responding to health checks from the secondaries.

So, to summarise, the volume was not responding, no query was being successfully executed, the CPU on the host was showing more than 50% in io-wait, and the manual stepdown did not work, only the host restart.

While this of course is a failure in the underlying hardware, is there a way to configure Mongo to failover in case the data volume shows this type of behaviour/failures?

Thanks

Stennie_X · March 28, 2021, 1:29am

Welcome to the MongoDB Community @Joao_Santos!

There is a Storage Node Watchdog feature you can enable to detect filesystem unresponsiveness and terminate the mongod process if a critical directory path is unresponsive:

By default, the Storage Node Watchdog is disabled. You can only enable the Storage Node Watchdog on a mongod at startup time by setting the watchdogPeriodSeconds parameter to an integer greater than or equal to 60. However, once enabled, you can pause the Storage Node Watchdog and restart during runtime. See watchdogPeriodSeconds parameter for details.

I would only use this wth a replica set member. If mongod is terminated by the watchdog process due to unresponsive I/O, mongod may not be able to cleanly restart. The documentation page (linked above) has more details.

Regards,
Stennie

system · April 2, 2021, 1:29am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.