Cloud Data Strategies: Preventing Data Black Holes in the Cloud

Leo Zheng


Black holes are regions in spacetime with such strong gravitational pull that nothing can escape. Not entirely destructive as you might have been led to believe, their gravitational effects help drive the formation and evolution of galaxies. In fact, our own Milky Way galaxy orbits a supermassive black hole with 4.1 million times the mass of the Sun. Some theorize that none of us would be here were it not for a black hole.

On the flip side, black holes can also be found hurtling through the cosmos — often at millions of miles per hour — tearing apart everything in their path. It’s said that anything that makes it into their event horizons, the “point of no return”, will never be seen or heard from again, making black holes some of the most interesting and terrifying objects in space.

Why are we going on about black holes, gravitational effects, and points of no return? Because something analogous is happening right now in computing.

First coined in 2010 by Dave McCrory, the concept of “data gravity” treats data as if it were a planet or celestial object with mass. As data accumulates in an environment, applications and services that rely on that data will naturally be pulled into the same environment. The larger the “mass” of data there is, the stronger the “gravitational pull” and the faster this happens. Applications and services each have their own gravity but data gravity is by far the strongest, especially as:

  • The farther away data is, the more drastic the impacts on application performance and user experience. Keeping applications and services physically nearby reduces latency, maximizes throughput, and makes it easier for teams to build performant applications.
  • Moving data around has a cost. In most cases, it makes sense to centralize data to reduce that cost, which is why data tends to amass in one location or environment. Yes, distributed systems do allow organizations to partition data in different ways for specific purposes — for example, fencing sets of data by geographic borders to comply with regulations — but within those partitions, minimal data movement is still desirable.
  • And finally, efforts to digitize business and organizational activities, processes, and models (dubbed by many as “digital transformation” initiatives) succeed or fail based on how effectively data is utilized. If software is the engine by which digital transformation happens, then data is its fuel.

As in the real world, the larger the mass of an object, the harder it is to move, so data gravity also means that once your mass of data gets large enough, it is also harder (and in some cases, near impossible) to move. What makes this relevant now more than ever is the shift to cloud computing. As companies move to the cloud, they need to make a decision that will have massive implications down the line — where and how are they going to store their data? And how do they not let data gravity in the cloud turn into a data black hole?

There are several options for organizations moving from building their own IT to consuming it as a service in the cloud.

Proprietary Tabular (Relational) Databases

The companies behind proprietary tabular databases often penalize their customers for running these technologies on any cloud platform other than their own. This should not surprise any of us. These are the same vendors that for decades have been relying on selling heavy proprietary software with multi-year contracts and annual maintenance fees. Vendor lock-in is nothing new to them.

Organizations choosing to use proprietary tabular databases in the cloud also carry over all the baggage of those technologies and realize few cloud benefits. These databases scale vertically and often cannot take advantage of cloud-native architectures for scale-out and elasticity without massive compromises. If horizontal scale-out of data across multiple instances is available, it isn’t native to the database and requires complex configurations, app-side changes, and additional software.

Lifting and shifting these databases to the cloud does not change the fact that they’re not designed to take advantage of cloud architectures.

Open Source Tabular Databases

Things are a little better with open source tabular databases insofar as there is no vendor enforcing punitive pricing to keep you on their cloud. However, similar to proprietary tabular databases, most of these technologies are designed to scale vertically; scaling out to fully realize cloud elasticity is often managed with fragile configurations or additional software.

Many companies running these databases in the cloud rely on a managed service to reduce their operational overhead. However, feature parity across cloud platforms is nonexistent, making migrations complicated and expensive. For example, databases running on Amazon Aurora leverage Aurora-specific features not found on other clouds.

Proprietary Cloud Databases

With proprietary cloud databases, it’s very easy to get into a situation where data goes in and nothing ever comes out. These database services run only in their parent cloud and often provide very limited database functionality, requiring customers to integrate additional cloud services for anything beyond very simple use cases.

For example, many of the proprietary cloud NoSQL services offer little more than key-value functionality; users often need to pipe data into a cloud data warehouse for more complex queries and analytics. They also tend to be operationally immature, requiring additional integrations and services to address data protection and provide adequate performance visibility. And it doesn’t stop there. New features are often introduced in the form of new services, and before users know it, instead of relying on a single cloud database, they’re dependent on an ever-growing network of cloud services. This makes it all the more difficult to ever get data out.

The major cloud providers know that if they’re able to get your data in one of their proprietary database services, they’ve got you right where they want you. And while some may argue that organizations should actually embrace this new, ultimate form of vendor lock-in to get the most out of the cloud, that doesn’t leave customers with many options if their requirements, or if data regulations, change. What if the cloud provider you’re not using releases a game-changing service you need to edge out your competition? What if they open up a data center in a new geographic region you’ve prioritized and yours doesn’t have it on their roadmap? What if your main customer dictates that you should sever ties with your cloud provider? It’s happened before.

These are all scenarios where you could benefit from using a database that runs the same, everywhere.

The database that runs the same ... everywhere

As you move into the cloud, how you prevent data gravity from turning against you and limiting your flexibility is simple — use a database that runs the same in any environment.

One option to consider is MongoDB. As a database, it combines the flexibility of the document data model with sophisticated querying and indexing required by a wide range of use cases, from simple key-value to real-time aggregations powering analytics.

MongoDB is a distributed database designed for the cloud at its core. Redundancy for resilience, horizontal scaling, and geographic distribution are native to the database and easy to use.

And finally, MongoDB delivers a consistent experience regardless of where it is deployed:

  • For organizations not quite ready to migrate to the cloud, they can deploy MongoDB on premises behind their own firewalls and manage their databases using advanced operational tooling.
  • For those that are ready to migrate to the cloud, MongoDB Atlas delivers the database as a fully managed service across more than 50 regions on AWS, Azure, and Google Cloud Platform. Built-in automation of proven practices helps reduce the number of time-consuming database administration tasks that teams are responsible for, and prevents organizations from migrating their operational overhead into the cloud as well. Of course, if you want to self-manage MongoDB in the cloud, you can do so.
  • And finally, for teams that are well-versed in cloud services, MongoDB Atlas delivers a consistent experience across AWS, Azure, and Google, allowing the development of multi-cloud strategies on a single, unified data platform.

Data gravity will no doubt have a tremendous impact on how your IT resources coalesce and evolve in the cloud. But that doesn’t mean you have to get trapped. Choose a database that delivers a consistent experience across different environments and avoid going past the point of no return.


To learn more about MongoDB, check out our architecture guide.

You can also get started with a free 512 MB database managed by MongoDB Atlas here.

Header image via Paramount