We recently moved from the shared cluster (where we had no downtime but degraded performance) to a dedicated cluster (M20 - M40) with autoscaling enabled. We had at least two instances where we hit 95%+ CPU utilization and had an hour or so of downtime as the cluster scaled up or down (even though it says rolling-update), the connections got stale and we even had to redeploy the app.
Is this normal? What are we doing wrong?
We just faced the same issue and I believe the documentation should be adjusted to not mislead customers. Differently from kubernets, the system seems to be restarting the nodes one by one. That means that for clusters setup with 3 nodes (2 readers), applications that are configured to prefer secondaries, may shift all load to one node, putting the complete system down with overloaded DB (timeouts) until the cluster is fully scaled and queues are normalized.
It would be very helpful if the documentation would describe the limitations instead of ensuring there are no downtimes, and even more helpful if minimum availability of 100% of the nodes is ensured during the scaling process. Meaning, all new nodes are fully started until the switch happens.
Hello, how are you? Welcome to the MongoDB community.
I have some experience and wanted to share my thoughts on the following: The first point is that if your cluster is at 100% CPU, Atlas has difficulty autoscaling and this can end up impacting your application. Atlas’s AutoScaling is undergoing adjustments, becoming more sensitive (you can see what has changed in the documentation). Previously, it would happen (rarely) that the cluster would receive a huge hit and be unable to scale, we would need to lower the workload percentage so that Atlas could autoscaling and meet the workload.
Another point that Felipe mentioned is that the cluster autoscales with the rolling up strategy, causing the nodes to be replaced one by one, starting with the replicas and if you use readPreference=secondaryPrefer, the connections will be pointed to the only read node. However, if the load on the node is 100% when it starts rolling up, you may need to intervene in the application, as I mentioned above.
Atlas AutoScaling works very well when the load increases gradually, but when it increases all at once, some problems can occur. However, MongoDB has been working more and more on AutoScaling and recently, as I mentioned, they have adjusted some things.
In any case, I would advise you to open a support ticket to ask questions 
1 Like