Normalization is the process of organizing data in such a way that it minimizes the redundancy in a database. Data Redundancy implies having the same piece of data in multiple places. Normalization often distributes a large collection(table) into multiple smaller collections.
The first database was invented 50 years ago, around 1970. Data storage was extremely expensive back then. Optimizing database storage was the primary goal of application developers.
Applications back then also weren’t storing the variety of data that we have today. Again, due to this limitation in cost, becoming creative with data storage often wasn’t possible. Things like storing terabytes or petabytes of data weren’t financially feasible for most use cases.
Take a look at the graph below, the storage costs have decreased dramatically over the years.
Though normalization helps us in reducing redundancy in our database, when we want to access data it requires us to perform joins across multiple tables and also it would require transactions to ensure updates are not performed halfway through.
Suppose we are building a social networking app, and we want to show the count of followers for an account on their profile.
Consider a famous account with >400K followers, now every time a user searches for this profile, in a naïve normalized design, we will have to go through all the entries in the followers table and count the number of followers for this particular user. This operation is very costly and hence takes a lot of time which as a result can cause very high latency.
Denormalization is an optimization technique that keeps redundant copies of data in multiple collections(or tables) to avoid joins and it allows us to keep a single entity’s data in one document and doesn’t spread it across multiple documents and collections. In this way, we don’t have to fetch the information from other collections when required, which eventually makes the query process a lot more efficient.
In the above example, as we saw, to display the profile of a user, we have to fetch the count of followers for the concerned user from the followers table.
Now, by utilizing Denormalization, we can simply add a new field called followersCount to the users’ table which contains the total number of followers a user have. In this way, we are storing a calculated field so that we don’t have to count the number of followers every time the profile is fetched, and we will simply return the count to this followersCount field and we will update this count every time a new user starts following or an existing follower unfollows the concerned user.
But we need to be careful when we are introducing denormalization in our data model, this design can significantly increase the write load if not implemented properly. Let’s consider the same example, where we are essentially caching the count of followers in the celebrity’s document itself. Every time the celebrity gets a new follower, we will have to update this count in the celebrity’s user document. And updating a document comes with a cost and we need to determine the right tradeoff between these costs and our application’s requirements.
We have learned what is denormalization, what are its benefits, and what are the tradeoffs. Let’s dive deeper into its application, tradeoffs, and things to consider while implementing denormalization in the next post.