Atlas Multi-Cloud Global Cluster: Always Available, Even in the Apocalypse!
Rate this article
In recent years, "high availability" has been a buzzword in IT. Using this phrase usually means having your application and services resilient to any disruptions as much as possible.
As vendors, we have to guarantee certain levels of uptime via SLA contracts, as maintaining high availability is crucial to our customers. These days, downtime, even for a short period of time, is widely unacceptable.
To improve our network stability and overhead, Atlas provides a "Local reads in all Zones" button. It directs Atlas to automatically associate at least one secondary from each shard to one of the other regions. With an appropriate
, our application will now be able to get data from all regions without the need to query it cross-region. See our
to better understand how to target local nodes or specific cloud nodes.
MongoDB 4.4 introduced another interesting feature around read preferences for sharded clusters, called
. A hedged read query is run across two secondaries for each shard and returns the fastest response. This can allow us to get a fast response even if it is served from a different cloud member. Since this feature is allowed for
non-Primaryread preferences (like
nearest), it should be considered to be eventually consistent. This should be taken into account with your
One of the latest breakthroughs the Atlas service presented is being able to run a deployment across cloud vendors (AWS, Azure, and GCP). This feature is now available also in Global Clusters configurations.
We are now able to have shards spanning multiple clouds and regions, in one cluster, with one unified connection string. Due to the smart tagging of the replica set and hosts, we can have services working isolated within a single cloud, or benefit from being cloud agnostic.
When you set up a Global Cluster, how it is configured will change the availability features. As you configure your cluster, you can immediately see how your configuration covers your resiliency, HA, and Performance requirements. It's an awesome feature! Let's dive into the full set:
|Ability||Description||Feature that covers it|
Low latency read and writes in <SHARD REGION>
|Having a Primary in each region allows us to query/write data within the region.||Defining a zone in a region covers this ability.|
|Local reads in all zones||If we want to query a local node for another zone data (e.g., in America, query for Europe documents), we need to allow each other zone to place at least one secondary in the local region (e.g., Europe shard will have one secondary in America region). This requires our reads to use a latency based ||Pressing the "Allow local reads in all zones" will place one secondary in each other zone.|
|Available during partial region outage||In case there is a cloud "availability zone" outage within a specific region, regions with more than one availability zone will allow the region to still function as normal.||Having the preferred region of the zone with a number of electable nodes span across two or more availability zones of the cloud provider to withstand an availability zone outage. Those regions will be marked with a star in the UI. For example: two nodes in AWS N. Virginia where each one is, by design, deployed over three potential availability zones.|
|Available during full region outage||In case there is a full cloud region outage, we need to have a majority of nodes outside this region to maintain a primary within the||Having a majority of "Electable" nodes outside of the zone region. For example: two nodes in N. Virginia, two nodes in N. California, and one node in Ireland|
|Available during full cloud provider outage||If a whole cloud provider is unavailable, the zones still have a majority of electable nodes on other cloud providers, and so the zones are not dependent on one cloud provider.||Having multi-cloud nodes in all three clouds will allow you to withstand one full cloud provider failure. For example: two nodes on AWS N.Virginia, two nodes on GCP Frankfurt, and one node on Azure London.|
After we have deployed our cluster, we now have a fully global cross-region, cross-cloud, fault-tolerant cluster with low read and write latencies across the globe. All this is accessed via a simple unified SRV connection string:
I don't think that our application has anything to fear, other than its own bugs.
To show how easy it is to manipulate this complex deployment, I YouTubed it:
To learn more about how to deploy a cross region global cluster to cover all of our fault tollerence best practices, check out the video.
Covering our global application demand and scale has never been easier, while keeping the highest possible availability and resiliency. Global multi-cloud clusters allow IT to sleep well at night knowing that their data is always available, even in the apocalypse!