Paginations 1.0: Time Series Collections in five minutes
Rate this article
As someone who loves to constantly measure myself and everything around me, I was excited to see MongoDB add dedicated time-series collections in MongoDB 5.0. Previously, MongoDB had been great for handling time-series data, but only if you were prepared to write some fairly complicated insert and update code and use a complex schema. In 5.0, all the hard work is done for you, including lots of behind-the-scenes optimization.
Working with time-series data brings some interesting technical challenges for databases. Let me explain.
Time-series data is where we have multiple related data points that have a time, a source, and one or more values. For example, I might be recording my speed on my bike and the gradient of the road, so I have the time, the source (me on that bike), and two data values (speed and gradient). The source would change if it was a different bike or another person riding it.
Time-series data is not simply any data that has a date component, but specifically data where we want to look at how values change over a period of time and so need to compare data for a given time window or windows. On my bike, am I slowing down over time on a ride? Or does my speed vary with the road gradient?
This means when we store time-series data, we usually want to retrieve or work with all data points for a time period, or all data points for a time period for one or more specific sources.
These data points tend to be small. A time is usually eight bytes, an identifier is normally only (at most) a dozen bytes, and a data point is more often than not one or more eight-byte floating point numbers. So, each "record" we need to store and access is perhaps 50 or 100 bytes in length.
This is where dealing with time-series data gets interesting—at least, I think it's interesting. Most databases, MongoDB included, store data on disks, and those are read and written by the underlying hardware in blocks of typically 4, 8, or 32 KB at a time. Because of these disk blocks, the layers on top of the physical disks—virtual memory, file systems, operating systems, and databases—work in blocks of data too. MongoDB, like all databases, uses blocks of records when reading,writing, and caching. Unfortunately, this can make reading and writing these tiny little time-series records much less efficient.
This animation shows what happens when these records are simply inserted into a general purpose database such as MongoDB or an RDBMS.
As each record is received, it is stored sequentially in a block on the disk. To allow us to access them, we use two indexes: one with the unique record identifier, which is required for replication, and the other with the source and timestamp to let us find everything for a specific device over a time period.
This is fine for writing data. We have quick sequential writing and we can amortise disk flushes of blocks to get a very high write speed.
The issue arises when we read. In order to find the data about one device over a time period, we need to fetch many of these small records. Due to the way they were stored, the records we want are spread over multiple database blocks and disk blocks. For each block we have to read, we pay a penalty of having to read and process the whole block, using database cache space equivalent to the block size. This is a lot of wasted compute resources.
MongoDB 5.0 has specialized time-series collections optimized for this type of data, which we can use simply by adding two parameters when creating a collection.
We don't need to change the code we use for reading or writing at all. MongoDB takes care of everything for us behind the scenes. This second animation shows how.
With a time-series collection, MongoDB organizes the writes so that data for the same source is stored in the same block, alongside other data points from a similar point in time. The blocks are limited in size (because so are disk blocks) and once we have enough data in a block, we will automatically create another one. The important point is that each block will cover one source and one span of time, and we have an index for each block to help us find that span.
Doing this means we can have much smaller indexes as we only have one unique identifier per block. We also only have one index per block, typically for the source and time range. This results in an overall reduction in index size of hundreds of times.
Not only that but by storing data like this, MongoDB is better able to apply compression. Over time, data for a source will not change randomly, so we can compress the changes in values that are co-located. This makes for a data size improvement of at least three to five times.
And when we come to read it, we can read it several times faster as we no longer need to read data, which is not relevant to our query just to get to the data we want.
And that, in a nutshell, is MongoDB time-series collections. I can just specify the time and source fields when creating a collection and MongoDB will reorganise my cycling data to make it three to five times smaller, as well as faster, to read and analyze.