How Much is Your Data Model Costing Your Business?

Daniel Coupal and Rick Houlihan

Economic volatility is creating an unpredictable business climate, forcing organizations to stretch their dollars further and do more with less. Investments are under the microscope, and managers are looking to wring every ounce of productivity out of existing resources. IT spend is a concern and many IT decision-makers aren't sure what's driving costs. Is it overprovisioning? Cloud sprawl? Shadow IT? One area that doesn't get a lot of attention is how the data is modeled in the database. That's unfortunate because data modeling can have a major impact in terms of the cost of database operations, the instance size necessary to handle workloads, and the work required to develop and maintain applications.

This post is also available in: 简体中文

Pareto patterns

Data access patterns are often an illustration of the Pareto Principle at work, where the majority of effects are driven by a minority of causes. Modern OLTP applications tend to work with data in small chunks. The vast majority of data access patterns (the way applications access and use data) work with either a single row of data or a range of rows from a single table. At least that's what we found at Amazon, looking at 10,000 services across all the various RDBMS based services we deployed. Normalized data models are quite efficient for these simple single table queries, but the less frequent complex patterns require the database to join tables to produce a result, exposing RDBMS inefficiencies. The high time complexity associated with these queries meant significantly more infrastructure was required to support them.

The relational database hides much of this overhead behind the scenes. When you send a query to a relational database, you don't actually see all the connections opening up on all the tables, or all the objects merging. Even though 90% of the access patterns at Amazon were for simple things, the 10% that were doing more complex things were burning through CPU to the point that my team estimated they were driving ~50% of infrastructure cost. This is where NoSQL data modeling can be a game-changer. NoSQL data models are designed to eliminate expensive joins, reduce CPU utilization, and save on compute costs.

Modeling for efficiency in NoSQL

There are two fundamental approaches to modeling relational data in NoSQL databases:

  1. Embedded Document - All related data is stored in a single rich document which can be efficiently retrieved when needed.

  2. Single Collection - Related data is split out into multiple documents to efficiently support access patterns that require subsets of a larger relational structure. Related documents are stored in a common collection and contain attributes that can be indexed to support queries for various groupings of related documents.

The key to building an efficient NoSQL data model and reducing compute costs is using the workload to influence the choice of data model. For example, a read-heavy workload like a product catalog that runs queries like, "get all the data for a product" or "get all the products in a category," will benefit from an embedded document model because it avoids overhead of reading multiple documents.

On the other hand, a write-heavy workload where writes are updating bits and pieces of a larger relational structure would run more efficiently with smaller documents stored in a single collection which can be accessed independently and indexed to support efficient retrieval when all the data is needed. The final choice depends on the frequency and nature of the write patterns and whether or not there's a high velocity read pattern that's operating concurrently. If your workload is read-intensive, you want to get as much as you can in one read. For a write-intensive workload, you don't want to have to rewrite the full document every time it changes.

Joins increase time complexity. In NoSQL databases, depending on the access pattern mix, all the rows from the relational tables are stored either in a single embedded document or as multiple documents in one collection that are linked together by indexes. Storing multiple related documents in a common collection means there is no need for joins. As long as you're indexing on a common dimension across documents, you can query for related documents very efficiently.

Now imagine a query that joins three tables in a relational database and your machine needs to do 1,000 of them. You would need to read at least 3,000 objects from multiple tables in order to satisfy the 1,000 queries. With the document model, by embedding all the related data in one document, the query would read only 1,000 objects from a single collection. Machine wise, having to merge 3,000 objects from three tables versus reading 1,000 from one collection will require a more powerful and expensive instance. With relational databases, you don't have as much control. Some queries may result in a lot of joins, resulting in higher time complexity which translates directly into more infrastructure required to support the workload.

Mitigate what matters

In a NoSQL database, you want to model data for the highest efficiency where it hurts the most in terms of cost. Analytical queries tend to be low frequency. It doesn't matter as much if they come back in 100 ms or 10 ms. You just want to get an answer. For things that run once an hour, once a day, or once a week, it's okay if they're not as efficient as they might be in a normalized relational database. Transactional workloads that are running thousands of transactions a second need to process as efficiently as possible because the potential savings are far greater.

Some users try to practice these data modeling techniques to increase efficiency in RDBMS platforms since most now support document structures similar to MongoDB. This might work for a small subset of workloads. But columnar storage is designed for relatively small rows that are the same size. They do work well for small documents, but when you start to increase the size of the row in a relational database, it requires off-row storage. In Postgres this is called TOAST (The Oversized-Attribute Storage Technique). This circumvents the size limit by putting the data in two places, but it also decreases performance in the process. The row based storage engines used by modern RDBMS platforms were not designed for large documents, and there is no way to configure them to store large documents efficiently.

Drawing out the relationship

The first step we recommend when modeling data is to characterize the workload by asking a few key questions:

  • What is the nature of the workload?

  • What is the entity relationship diagram (ERD)?

  • What are the access patterns?

  • What is the velocity of each pattern?

  • Where are the most important queries that we need to optimize?

Identifying the entities and their relationships to each other is going to form the basis of our data model. Once this is done we can begin to distill the access patterns. If it's a read heavy workload like the product catalog you'll most likely be working with large objects, which is fine. There are plenty of use cases for that. However, if you're working with more complex access patterns where you're accessing or updating small pieces of a larger relational structure independently, you will want the data separated into smaller documents so you can efficiently execute those high velocity updates. We teach many of these techniques in our MongoDB University course, M320: MongoDB Data Modeling.

Working with indexes

Using indexes for high-frequency patterns will give you the best performance. Without an index, you have to read every document in the collection and examine it to determine which documents match the query conditions. An index is a B-tree structure that can be parsed quickly to identify documents that match conditions on the indexed attributes specified by the query.

You may choose to not index uncommon patterns for various reasons. All indexes incur cost as they must be updated whenever a document is changed. You might have a high velocity write pattern that runs consistently and a low velocity read that happens at the end of the day, in which case you'll accept the higher cost of the full collection scan for the read query rather than incur the cost of updating the index on every write. If you are writing to a collection 1,000 times a second and reading once a day, the last thing you want to do is add an index update for every single write just to make the read efficient. Again, it depends on the workload. Indexes in general should be created for high-velocity patterns, and your most frequent access patterns should be covered by indexes to some extent, either partially or fully.

Remember that an index still incurs cost even if you don't read it very much or at all. Always make sure when you define an index that there is a good reason for it, and that good reason should be that you have a high frequency access pattern that needs to use it to be able to read the data efficiently.

Data modeling and developer productivity

Even after you've optimized your data model, cost savings will continue to accrue downstream as developers find that they can develop, iterate, and maintain systems far more efficiently than in a relational database. Specific document design patterns and characteristics of NoSQL can reduce maintenance overhead and in many cases eliminate maintenance tasks altogether. For example, document databases like MongoDB support flexible schema which eliminates the need for maintenance windows related to schema migrations and refactoring of a catalog as with RDBMS. A schema change in a relational database almost always impacts ORM data adapters that would need to be refactored to accommodate the change. That's a significant amount of code maintenance for developers.

With a NoSQL database like MongoDB, there's no need for cumbersome and fragile ORM abstraction layers. Developers can store object data in its native form instead of having to normalize it for a tabular model. Updating data objects in MongoDB requires almost zero maintenance. The application just needs to be aware documents may have new properties, and how to update them to the current schema version if they don’t.

MongoDB will lower license fees and infrastructure costs significantly, but possibly the biggest savings organizations experience from moving away from RDBMS will come from reduced development costs. Not only is there less code overall to maintain, but the application will also be easier to understand for someone who didn't write the code. MongoDB makes migrations far simpler and less prone to failure and downtime. Applications can be updated more frequently, in an easier fashion, and without stressing about whether a schema update will fail and require a rollback. Overall, maintaining applications over their lifetime is far easier with NoSQL databases like MongoDB. These efficiencies add up to significant savings over time.

It's also worth mentioning that a lot of up-and-coming developers see relational databases as legacy technology and not technology they prefer to use. With MongoDB it is easier to attract top talent, a critical factor in any organization's ability to develop best-of-breed products and accelerate time-to-value.

Uplevel your NoSQL data modeling skills

If you want to start reining in the hidden costs in your software development lifecycle by learning how to model data, MongoDB University offers a special course, M320: MongoDB Data Modeling. There are also dozens of other free courses, self-paced video lessons, on-demand labs, and certifications with digital badges to help you master all aspects of developing with MongoDB.