Cannot reconnect to cluster during autoscaling

Francesco_Virga · April 12, 2022, 6:27pm

We are finding that our application can maintain a connection to our Atlas cluster while it autoscales, but if we try to restart our application then it fails to connect. We get the error MongoServerSelectionError: The server is in quiesce mode and will shut down and MongoServerSelectionError: Server selection timed out after 30000 ms. Here is the error report we are getting.

We are having this occur quite often in production when we see spikes in traffic. In these cases, our Atlas cluster will scale up, as well as our Kubernetes pods. When the new pods are created, our application fails to connect.

According to the Cluster autoscaling docs: “Auto-scaling works on a rolling basis, meaning the process doesn’t incur any downtime.” but this does not seem to be the case for us.

Francesco_Virga · April 12, 2022, 6:45pm

I noticed that the connection string suggested by Mongo Atlas is of the format:

mongodb+srv://<username>:<password>@<cluster name>-pri.<random code>.gcp.mongodb.net/<db name>

Is it possible that by appending -pri to the cluster name, we are only connecting to the primary node, which will then see downtime when its spun down? I noticed that we can remove the -pri and still connect to the DB.

Jason_Tran · April 13, 2022, 8:54pm

Hi @Francesco_Virga - Welcome to the community!

We are finding that our application can maintain a connection to our Atlas cluster while it autoscales, but if we try to restart our application then it fails to connect.

According to the Cluster autoscaling docs: “Auto-scaling works on a rolling basis, meaning the process doesn’t incur any downtime.” but this does not seem to be the case for us.

Can you confirm if only during auto-scale of the Atlas cluster, your application is still able to connect as per normal and that the disconnection and server selection errors only occur when both the Atlas cluster is auto-scaling and new pods are created.

After this clarification, could you also provide the following details / information:

Connection string including any options being used (Please redact any sensitive information / credentials)
Driver being used
Driver version being used
If you have tried connecting from a non-containerised environment during auto-scale to see if the disconnection errors occur as well

Is it possible that by appending -pri to the cluster name, we are only connecting to the primary node, which will then see downtime when its spun down? I noticed that we can remove the -pri and still connect to the DB.

I would recommend going over the details in this post here and let me know if you have any further questions regarding the additional -pri in the connection string. However, in short, the -pri that you see in the connection string is not associated with the primary node directly. You may find the information on the Private Connection Strings page useful as well.

Regards,
Jason

Francesco_Virga · April 13, 2022, 11:08pm

Hi @Jason_Tran, thanks for getting back to me!

I revisited the issue and realized it also happens with running pods. I’ve been able to reproduce this locally by running our application and connecting to the DB. I created an issue in the Mongoose GitHub which outlines all the information and logs, see here.

mongodb+srv://<user>:<pwd>@<cluster-name>-pri.<random-code>.gcp.mongodb.net/<db name>?retryWrites=true&w=majority got this directly from the Atlas UI.
“mongoose” v6.1.6, which uses “mongodb” v4.2.2 (Nodejs)
See 2.
Yes, I’ve seen it occur even without autoscaling, simply running the server locally.

Marcus_Menezes · November 29, 2023, 5:54pm

@Jason_Tran have you found the solution?

Felipe_Andre_Malaquias · October 11, 2024, 7:58am

We just faced the same issue and I believe the documentation should be adjusted to not mislead customers. Differently from kubernets, the system seems to be restarting the nodes one by one. That means that for clusters setup with 3 nodes (2 readers), applications that are configured to prefer secondaries, may shift all load to one node, putting the complete system down with overloaded DB (timeouts) until the cluster is fully scaled and queues are normalized.
It would be very helpful if the documentation would describe the limitations instead of ensuring there are no downtimes, and even more helpful if minimum availability of 100% of the nodes is ensured during the scaling process. Meaning, all new nodes are fully started until the switch happens.