Failover not working as advertised

Frik_N_A · March 10, 2022, 11:10am

Primary node is having an issue and not failing over to secondary!

Result is that the entire service is down for us and there is zero support

Stennie_X · March 10, 2022, 12:58pm

Welcome to the MongoDB Community Forums @Frik_N_A !

Please provide some more information about your deployment:

Replica set configuration: rs.conf() with any hostnames anonymised.

Specific version of MongoDB server as reported by db.version() in the MongoDB shell,
Description of the current state of the replica set: rs.status() with any hostnames anonymised.

You only mention a primary and a single secondary, which is a scenario that does not provide any fault tolerance.

However, I assume you probably have at least three voting members in your replica set since you are expecting failover. Hopefully we’ll be able to figure out the issue given the additional requested details.

Regards,
Stennie

Frik_N_A · March 10, 2022, 1:00pm

Your support team forced a reboot on the primary node and it is good now. I don’t understand how a primary node can fail and there is no failover with the mongodb service. Is there something that needs to be configured? I would assume it should be the standard behavior.

Stennie_X · March 10, 2022, 9:58pm

Hi @Frik_N_A,

Automatic failover is definitely the expected behaviour: there is nothing to configure on the server side to enable failover in a replica set. From the client/driver side you need to be using a Replica Set Connection String URI so any changes in deployment topology (election of new primary, addition/removal of members, …) are detected and handled.

Since you have a managed service, it would be best to ask the support team for context on what happened with your specific deployment and why they restarted the primary (which is not a typical requirement for failover).

One possible scenario is that your driver was not detecting a replica set configuration change, so resetting the primary would have dropped any connections and perhaps forced the driver to rediscover the current configuration.

From a client/driver point of view, I would:

Make sure a replica set connection URI is used (or mongodb+srv://) if this is an Atlas cluster
Check on driver/ODM versions and any known issues.
If this is a dedicated Atlas cluster (M10+), use the Test Failover functionality to make sure your application handles failover events.

If you’d like to share more details for investigating the client/driver side, the community may have some suggestions. I also recommend testing failover to make sure the process works as expected.

However, I would start by trying to get root cause analysis or a theory on what happened from the deployment perspective as that may explain why failover did not occur. I will flag this discussion with our support team to see if they can find the relevant case or conversation, but if you already have an active conversation replying there directly will engage the team members who previously investigated your issue.

Regards,
Stennie

Frik_N_A · March 11, 2022, 10:27am

Thanks. Sounds like I did all the right things. The Test Failover would not work: “The failover test cannot begin, as not all servers in the cluster are in a healthy state”.

The chat support with Ayush asked for more info, but then went silent for hours… so I was forced to pay for support to get this resolved. Then a day later I get this reply from Ayush: " Apologies for delay! Could you please confirm if you still facing the issue?"

Your support team had to do a manual reboot to get the cluster back online, but paying for support was the only way to get our service back online. Not sure that is reasonable when part of the motivation for going with MongoDb service is reliability.

They are still investigating the root cause issue… hope to get better confidence that this won’t happen again and impact our entire service offering as we are thinking about migrating our services to MongoDb.

system · March 16, 2022, 10:27am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.