How Much is Your Data Model Costing Your Business?
Economic volatility is creating an unpredictable business climate, forcing organizations to stretch their dollars further and do more with less. Investments are under the microscope, and managers are looking to wring every ounce of productivity out of existing resources. IT spend is a concern and many IT decision-makers aren't sure what's driving costs. Is it overprovisioning? Cloud sprawl? Shadow IT? One area that doesn't get a lot of attention is how the data is modeled in the database. That's unfortunate because data modeling can have a major impact in terms of the cost of database operations, the instance size necessary to handle workloads, and the work required to develop and maintain applications. Pareto patterns Data access patterns are often an illustration of the Pareto Principle at work, where the majority of effects are driven by a minority of causes. Modern OLTP applications tend to work with data in small chunks. The vast majority of data access patterns (the way applications access and use data) work with either a single row of data or a range of rows from a single table. At least that's what we found at Amazon , looking at 10,000 services across all the various RDBMS based services we deployed. Normalized data models are quite efficient for these simple single table queries, but the less frequent complex patterns require the database to join tables to produce a result, exposing RDBMS inefficiencies. The high time complexity associated with these queries meant significantly more infrastructure was required to support them. The relational database hides much of this overhead behind the scenes. When you send a query to a relational database, you don't actually see all the connections opening up on all the tables, or all the objects merging. Even though 90% of the access patterns at Amazon were for simple things, the 10% that were doing more complex things were burning through CPU to the point that my team estimated they were driving ~50% of infrastructure cost. This is where NoSQL data modeling can be a game-changer. NoSQL data models are designed to eliminate expensive joins, reduce CPU utilization, and save on compute costs. Modeling for efficiency in NoSQL There are two fundamental approaches to modeling relational data in NoSQL databases: Embedded Document - All related data is stored in a single rich document which can be efficiently retrieved when needed. Single Collection - Related data is split out into multiple documents to efficiently support access patterns that require subsets of a larger relational structure. Related documents are stored in a common collection and contain attributes that can be indexed to support queries for various groupings of related documents. The key to building an efficient NoSQL data model and reducing compute costs is using the workload to influence the choice of data model. For example, a read-heavy workload like a product catalog that runs queries like, "get all the data for a product" or "get all the products in a category," will benefit from an embedded document model because it avoids overhead of reading multiple documents. On the other hand, a write-heavy workload where writes are updating bits and pieces of a larger relational structure would run more efficiently with smaller documents stored in a single collection which can be accessed independently and indexed to support efficient retrieval when all the data is needed. The final choice depends on the frequency and nature of the write patterns and whether or not there's a high velocity read pattern that's operating concurrently. If your workload is read-intensive, you want to get as much as you can in one read. For a write-intensive workload, you don't want to have to rewrite the full document every time it changes. Joins increase time complexity. In NoSQL databases, depending on the access pattern mix, all the rows from the relational tables are stored either in a single embedded document or as multiple documents in one collection that are linked together by indexes. Storing multiple related documents in a common collection means there is no need for joins. As long as you're indexing on a common dimension across documents, you can query for related documents very efficiently. Now imagine a query that joins three tables in a relational database and your machine needs to do 1,000 of them. You would need to read at least 3,000 objects from multiple tables in order to satisfy the 1,000 queries. With the document model, by embedding all the related data in one document, the query would read only 1,000 objects from a single collection. Machine wise, having to merge 3,000 objects from three tables versus reading 1,000 from one collection will require a more powerful and expensive instance. With relational databases, you don't have as much control. Some queries may result in a lot of joins, resulting in higher time complexity which translates directly into more infrastructure required to support the workload. Mitigate what matters In a NoSQL database, you want to model data for the highest efficiency where it hurts the most in terms of cost. Analytical queries tend to be low frequency. It doesn't matter as much if they come back in 100 ms or 10 ms. You just want to get an answer. For things that run once an hour, once a day, or once a week, it's okay if they're not as efficient as they might be in a normalized relational database. Transactional workloads that are running thousands of transactions a second need to process as efficiently as possible because the potential savings are far greater. Some users try to practice these data modeling techniques to increase efficiency in RDBMS platforms since most now support document structures similar to MongoDB. This might work for a small subset of workloads. But columnar storage is designed for relatively small rows that are the same size. They do work well for small documents, but when you start to increase the size of the row in a relational database, it requires off-row storage. In Postgres this is called TOAST (The Oversized-Attribute Storage Technique). This circumvents the size limit by putting the data in two places, but it also decreases performance in the process. The row based storage engines used by modern RDBMS platforms were not designed for large documents, and there is no way to configure them to store large documents efficiently. Drawing out the relationship The first step we recommend when modeling data is to characterize the workload by asking a few key questions: What is the nature of the workload? What is the entity relationship diagram (ERD)? What are the access patterns? What is the velocity of each pattern? Where are the most important queries that we need to optimize? Identifying the entities and their relationships to each other is going to form the basis of our data model. Once this is done we can begin to distill the access patterns. If it's a read heavy workload like the product catalog you'll most likely be working with large objects, which is fine. There are plenty of use cases for that. However, if you're working with more complex access patterns where you're accessing or updating small pieces of a larger relational structure independently, you will want the data separated into smaller documents so you can efficiently execute those high velocity updates. We teach many of these techniques in our MongoDB University course, M320: MongoDB Data Modeling . Working with indexes Using indexes for high-frequency patterns will give you the best performance. Without an index, you have to read every document in the collection and examine it to determine which documents match the query conditions. An index is a B-tree structure that can be parsed quickly to identify documents that match conditions on the indexed attributes specified by the query. You may choose to not index uncommon patterns for various reasons. All indexes incur cost as they must be updated whenever a document is changed. You might have a high velocity write pattern that runs consistently and a low velocity read that happens at the end of the day, in which case you'll accept the higher cost of the full collection scan for the read query rather than incur the cost of updating the index on every write. If you are writing to a collection 1,000 times a second and reading once a day, the last thing you want to do is add an index update for every single write just to make the read efficient. Again, it depends on the workload. Indexes in general should be created for high-velocity patterns, and your most frequent access patterns should be covered by indexes to some extent, either partially or fully. Remember that an index still incurs cost even if you don't read it very much or at all. Always make sure when you define an index that there is a good reason for it, and that good reason should be that you have a high frequency access pattern that needs to use it to be able to read the data efficiently. Data modeling and developer productivity Even after you've optimized your data model, cost savings will continue to accrue downstream as developers find that they can develop, iterate, and maintain systems far more efficiently than in a relational database. Specific document design patterns and characteristics of NoSQL can reduce maintenance overhead and in many cases eliminate maintenance tasks altogether. For example, document databases like MongoDB support flexible schema which eliminates the need for maintenance windows related to schema migrations and refactoring of a catalog as with RDBMS. A schema change in a relational database almost always impacts ORM data adapters that would need to be refactored to accommodate the change. That's a significant amount of code maintenance for developers. With a NoSQL database like MongoDB, there's no need for cumbersome and fragile ORM abstraction layers. Developers can store object data in its native form instead of having to normalize it for a tabular model. Updating data objects in MongoDB requires almost zero maintenance. The application just needs to be aware documents may have new properties, and how to update them to the current schema version if they don’t. MongoDB will lower license fees and infrastructure costs significantly, but possibly the biggest savings organizations experience from moving away from RDBMS will come from reduced development costs. Not only is there less code overall to maintain, but the application will also be easier to understand for someone who didn't write the code. MongoDB makes migrations far simpler and less prone to failure and downtime. Applications can be updated more frequently, in an easier fashion, and without stressing about whether a schema update will fail and require a rollback. Overall, maintaining applications over their lifetime is far easier with NoSQL databases like MongoDB. These efficiencies add up to significant savings over time. It's also worth mentioning that a lot of up-and-coming developers see relational databases as legacy technology and not technology they prefer to use. With MongoDB it is easier to attract top talent, a critical factor in any organization's ability to develop best-of-breed products and accelerate time-to-value. Uplevel your NoSQL data modeling skills If you want to start reining in the hidden costs in your software development lifecycle by learning how to model data, MongoDB University offers a special course, M320: MongoDB Data Modeling . There are also dozens of other free courses, self-paced video lessons, on-demand labs, and certifications with digital badges to help you master all aspects of developing with MongoDB.
Building with Patterns: A Summary
As we wrap up the Building with Patterns series, it’s a good opportunity to recap the problems the patterns that have been covered solve and highlight some of the benefits and trade-offs each pattern has. The most frequent question that is asked about schema design patterns, is “I’m designing an application to do X, how do I model the data?” As we hope you have discovered over the course of this blog series, there are a lot of things to take into consideration to answer that.
Building with Patterns: The Schema Versioning Pattern
It has been said that the only thing constant in life is change. This holds true to database schemas as well. Information we once thought wouldn’t be needed, we now want to capture. Or new services become available and need to be included in a database record. Regardless of the reason behind the change, after a while, we inevitably need to make changes to the underlying schema design in our application. While this often poses challenges, and perhaps at least a few headaches in a legacy tabular database system, in MongoDB we can use the Schema Versioning pattern to make the changes easier.
Building with Patterns: The Document Versioning Pattern
Databases, such as MongoDB, are very good at querying lots of data and updating that data frequently. In most cases, however, we are only performing queries on the latest state of the data. What about situations in which we need to query previous states of the data? What if we need to have some functionality of version control of our documents? This is where we can use the Document Versioning Pattern.
Building with Patterns: The Preallocation Pattern
One of the great things about MongoDB is the document data model. It provides for a lot of flexibility not only in schema design but in the development cycle as well. Not knowing what fields will be required down the road is easily handled with MongoDB documents. However, there are times when the structure is known and being able to fill or grow the structure makes the design much simpler. This is where we can use the Preallocation Pattern.
Building with Patterns: The Tree Pattern
Many of the schema design patterns we've covered so far have stressed that saving time on JOIN operations is a benefit. Data that's accessed together should be stored together and some data duplication is okay. A schema design pattern like Extended Reference is a good example. However, what if the data to be joined is hierarchical? For example, you would like to identify the reporting chain from an employee to the CEO? MongoDB provides the $graphLookup operator to navigate the data as graphs, and that could be one solution. However, if you need to do a lot of queries of this hierarchical data structure, you may want to apply the same rule of storing together data that is accessed together. This is where we can use the Tree Pattern.
Building with Patterns: The Approximation Pattern
Imagine a fairly decent sized city of approximately 39,000 people. The exact number is pretty fluid as people move in and out of the city, babies are born, and people die. We could spend our days trying to get an exact number of residents each day. But most of the time that 39,000 number is "good enough." Similarly, in many applications we develop, knowing a "good enough" number is sufficient. If a "good enough" number is good enough then this is a great opportunity to put the Approximation Pattern to work in your schema design.
Building with Patterns: The Extended Reference Pattern
Throughout this Building With Patterns series, I hope you've discovered that a driving force in what your schema should look like, is what the data access patterns for that data are. If we have a number of similar fields, the Attribute Pattern may be a great choice. Does accommodating access to a small portion of our data vastly alter our application? Perhaps the Outlier Pattern is something to consider. Some patterns, such as the Subset Pattern, reference additional collections and rely on JOIN operations to bring every piece of data back together. What about instances when there are lots of JOIN operations needed to bring together frequently accessed data? This is where we can use the Extended Reference pattern.
Building with Patterns: The Subset Pattern
Some years ago, the first PCs had a whopping 256KB of RAM and dual 5.25" floppy drives. No hard drives as they were incredibly expensive at the time. These limitations resulted in having to physically swap floppy disks due to a lack of memory when working with large (for the time) amounts of data. If only there was a way back then to only bring into memory the data I frequently used, as in a subset of the overall data. Modern applications aren't immune from exhausting resources. MongoDB keeps frequently accessed data, referred to as the working set , in RAM. When the working set of data and indexes grows beyond the physical RAM allotted, performance is reduced as disk accesses starts to occur and data rolls out of RAM.
Building With Patterns: The Computed Pattern
We've looked at various ways of optimally storing data in the Building with Patterns series. Now, we're going to look at a different aspect of schema design. Just storing data and having it available isn't, typically, all that useful. The usefulness of data becomes much more apparent when we can compute values from it. What's the total sales revenue of the latest Amazon Alexa? How many viewers watched the latest blockbuster movie? These types of questions can be answered from data stored in a database but must be computed.
Building With Patterns: The Outlier Pattern
So far in this Building with Patterns series, we've looked at the Polymorphic , Attribute , and Bucket patterns. While the document schema in these patterns has slight variations, from an application and query standpoint, the document structures are fairly consistent. What happens, however, when this isn't the case? What happens when there is data that falls outside the "normal" pattern? What if there's an outlier?
Building with Patterns: The Bucket Pattern
In this edition of the Building with Patterns series, we're going to cover the Bucket Pattern. This pattern is particularly effective when working with Internet of Things (IoT), Real-Time Analytics, or Time-Series data in general. By bucketing data together we make it easier to organize specific groups of data, increasing the ability to discover historical trends or provide future forecasting and optimize our use of storage.