There is a yellow warning icon on the left of the primary node of my atlas cluster and I can't connect to it

mba_cat · May 18, 2021, 10:51am

@MaBeuLux88_xxx thank you for your comment, I also believed that any updates would be performed in a rolling manner to a cluster but in this case that was not what happened.

Thank you for all your assistance and my apologies for the long post that follows:

TLDR:
Need to fully understand impact for upcoming project as we cannot have production cluster unavailable due to circumstances beyond our control.

Detail:
As per the case above I saw my cluster (primary and both secondaries) with the warning icon and the entire cluster was unavailable for approximately 20 minutes - this corresponds to the timings show in the activity feed post incident (but the activity feed did not show this at the time).

In addition other users experienced the same issue as above also recorded in post 107411:

@Jurn_Ho - same issue with M0 cluster in AWS eu-central-1 but other M0 clusters also in AWS eu-central-1 were not impacted
@Juan_Diaz_1 - unknown cluster unavailable at same time for ~30 minutes

Given that this does not appear to be a one-off issue impacting just myself can you investigate and advise what the issue with the update / cluster was.

Some questions:

As per the logs above why did the update take so long? (logs above show 19:44 start and 20:01 complete)
Why was the cluster unavailable - was the update not applied in a rolling manner?
Was the update applied to all nodes simultaneously because this is an M0 cluster?
Was the non-rolling update related to this cluster being a M0 or being in the aws eu-central-1 region?
Would a paid or dedicated cluster have experienced this downtime - do you have any evidence of this?

I appreciate a lot of these questions are overkill, especially considering this is a M0 cluster however I am in the middle of validating and pricing Mongo Atlas and Mongo Realm for a large project which is looking to deploy multiple large multi-region clusters (M60+) to support our global app and 200k users and need to understand this impact as we cannot have clusters randomly unavailable due to circumstances we cannot control in a production system and maintain our SLA with our end users.