Slowly Changing Dimensions and Their Application in MongoDB
Rate this article
The concept of “slowly changing dimensions” (usually abbreviated as SCD) has been around for a long time and is a staple in SQL-based data warehousing. The fundamental idea is to track all changes to data in the data warehouse over time. The “slowly changing” part of the name refers to the assumption that the data that is covered by this data model changes with a low frequency, but without any apparent pattern in time. This data model is used when the requirements for the data warehouse cover functionality to track and reproduce outputs based on historical states of data.
One common case of this is for reporting purposes, where the data warehouse must explain the difference of a report produced last month, and why the aggregated values are different in the current version of the report. Requirements such as these are often encountered in financial reporting systems.
There are many ways to implement slowly changing dimensions in SQL, referred to as the “types.” Types 0 and 1 are the most basic ones that only keep track of the current state of data (in Type 1) or in the original state (Type 0). The most commonly applied one is Type 2. SCD Type 2 implements three new fields, “validFrom,” “validTo,” and an optional flag on the latest set of data, which is usually called “isValid” or “isEffective.”
Table of SCD types:
|SCD Type 0||Only keep original state, data can not be changed|
|SCD Type 1||Only keep updated state, history can not be stored|
|SCD Type 2||Keep history in new row/document|
|SCD Type 3||Keep history in new fields in same row/document|
|SCD Type 4||Keep history in separate collection|
|SCD Types >4||Combinations of previous types — e.g., Type 5 is Type 1 plus Type 4|
In this simplest implementation of SCD, every record contains the information on the validity period for this set of data and all different validities are kept in the same collection or table.
In applying this same concept to MongoDB’s document data model, the approach is exactly the same as in a relational database. In the comparison of data models, the normalization that is the staple of relational databases is not the recommended approach in the document model, but the details of this have been covered in many blog posts — for example, the . The concept of slowly changing dimensions applies on a per document basis in the chosen and optimized data model for the specific use case. The best way to illustrate this is in a small example.
Consider the following use case: Your MongoDB stores the prices of a set of items, and you need to keep track of the changes of the price of an item over time, in order to be able to process returns of an item, as the money refunded needs to be the price of the item at the time of purchase. You have a simple collection called “prices” and each document has an itemID and a price.
Now, the price of “pants” changes from 5 to 7. This can be done and tracked by assuming default values for the necessary data fields for SCD Type 2. The default value for “validFrom” is 01.01.1900, “validTo” is 01.01.9999, and isValid is “true.”
The change to the price of the “pants” item is then executed as an insert of the new document, and an update to the previously valid one.
As it is essential that the chain of validity is unbroken, the two database operations should happen with the same timestamp. Depending on the requirements of the application, it might make sense to wrap these two commands into a transaction to ensure both changes are always applied together. There are also ways to push this process to the background, but as per the initial assumption in the slowly changing dimensions, changes like this are infrequent and data consistency is the highest priority. Therefore, the performance impact of a transaction is acceptable for this use case.
If you then want to query the latest price for an item, it’s as simple as specifying:
And if you want to query for the state at a specific point in time:
This example shows that the flexibility of the document model allows us to take a relational concept and directly apply it to data inside MongoDB. But it also opens up other methods that are not possible in relational databases. Consider the following: What if you only need to track changes to very few fields in a document? Then you could simply embed the history of a field as an array in the first document. This implements SCD Type 3, storing the history in new fields, but without the limitation and overhead of creating new columns in a relational database. SCD Type 3 in RDMBS is usually limited to storing only the last few changes, as adding new columns on the fly is not possible.
The following aggregation pipeline does exactly that. It changes the price to 7, and stores the previous value of the price with a timestamp of when the old price became invalid in an array called “priceHistory”:
There are some caveats to that solution which cover large array sizes, but there are known solutions to deal with these kinds of data modeling challenges. In order to avoid large arrays, you could apply the “Outlier” or “Bucketing” patterns of the many possibilities in and many useful explanations on .
In this way, you could store the most recent history of data changes in the documents themselves, and if any analysis gets deeper into past changes, it would have to load the older change history from a separate collection. This approach might sound similar to the stated issue of adding new fields in a relational database, but there are two differences: Firstly, MongoDB does not encounter this problem until more than 100 changes are done on a single document. And secondly, MongoDB has tools to dynamically deal with large arrays, whereas in relational DBs, the solution would be to choose a different approach, as even pre-allocating more than 10 columns for changes is not a good idea in SQL.
But in both worlds, dealing with many changes in SCD Type 3 requires an extension to a different SCD type, as having a separate collection for the history is SCD Type 4.
The shown example focuses on a strict and accurate representation of changes. Sometimes, there are less strict requirements on the necessity to show historical changes in data. It might be that 95% of the time, the applications using the MongoDB database are only interested in the current state of the data, but some (analytical) queries still need to be run on the full history of data.
If the requirement for tracking the changes is different in a way that not every single change needs to be tracked, but rather a series of checkpoints is required to show the state of data at specific times, then Atlas Data Lake might be the correct solution. With , you are able to extract a snapshot of the data at specific points in time, giving you a similar level of traceability, albeit at fixed time intervals. Initially the concept of SCD was developed to avoid data duplication in such a case, as it does not store an additional document if nothing changes. In today's world where cold storage has become much more affordable, Data Lake offers the possibility to analyze data from your productive system, using regular snapshots, without doing any changes to the system or even increasing the load on the core database.
All in all, the concept of slowly changing dimensions enables you to cover part of the core requirements for a data warehouse by giving you the necessary tools to keep track of all changes.
While the fundamental concept of slowly changing dimensions was developed with data warehouses in mind, another area where derivatives of the techniques developed there can be useful is in event-driven applications. Given the case that you have infrequent events, in different types of categories, it’s oftentimes an expensive database action to find the latest event per category. The process for that might require grouping and/or sorting your data in order to find the current state.
In this case, it might make sense to amend the data model by a flag similar to the “isValid'' flag of the SCD Type 2 example above, or even go one step further and not only store the event time per document, but adding the time of the next event in a similar fashion to the SCD Type 2 implementation. The flag enables very fast queries for the latest set of data per event type, and the date ensures that if you execute a search for a specific point in time, it’s easy and efficient to get the respective event that you are looking for.
In such a case, it might make sense to separate the “events” and their processed versions that include the isValid flag and the validity end date into separate collections, utilizing more of the methodologies of the different types of SCD implementations.