MongoDB's New Time Series Collections
Time-series data are measurements taken at time intervals. Sometimes time-series data will come into your database at high frequency - use-cases like financial transactions, stock market data, readings from smart meters, or metrics from services you're hosting over hundreds or even thousands of servers. In other cases, each measurement may only come in every few minutes. Maybe you're tracking the number of servers that you're running every few minutes to estimate your server costs for the month. Perhaps you're measuring the soil moisture of your favourite plant once a day.
|Regular||Service metrics||Number of sensors providing weather metrics|
|Irregular||Financial transactions, Stock prices?||LPWAN data|
However, when it comes to time-series data, it isn’t all about frequency, the only thing that truly matters is the presence of time so whether your data comes every second, every 5 minutes, or every hour isn’t important for using MongoDB for storing and working with time-series data.
From the very beginning, developers have been using MongoDB to store time-series data. MongoDB can be an extremely efficient engine for storing and processing time-series data, but you'd have to know how to correctly model it to have a performant solution, but that wasn't as straightforward as it could have been.
Starting in MongoDB 5.0 there is a new collection type, time-series collections, which are specifically designed for storing and working with time-series data without the hassle or need to worry about low-level model optimization.
Time series collections are a new collection type introduced in MongoDB 5.0. On the surface, these collections look and feel like every other collection in MongoDB. You can read and write to them just like you do regular collections and even create secondary indexes with the createIndex command. However, internally, they are natively supported and optimized for storing and working with time-series data.
Under the hood, the creation of a time series collection results in a collection and an automatically created writable non-materialized view which serves as an abstraction layer. This abstraction layer allows you to always work with their data as single documents in their raw form without worry of performance implications as the actual time series collection implements a form of the bucket pattern you may already know when persisting data to disk, but these details are something you no longer need to care about when designing your schema or reading and writing your data. Users will always be able to work with the abstraction layer and not with a complicated compressed bucketed document.
Well because you have time-series data, right?
Of course that may be true, but there are so many more reasons to use the new time series collections over regular collections for time-series data.
Ease of use, performance, and storage efficiency were paramount goals when creating time series collections. Time series collections allow you to work with your data model like any other collection as single documents with rich data types and structures. They eliminate the need to model your time-series data in a way that it can be performant ahead of time - they take care of all this for you!
You can design your document models more intuitively, the way you would with other types of MongoDB collections. The database then optimizes the storage schema for ingestion, retrieval, and storage by providing native compression to allow you to efficiently store your time-series data without worry about duplicated fields alongside your measurements.
Despite being implemented in a different way from the collections you've used before, to optimize for time-stamped documents, it's important to remember that you can still use the MongoDB features you know and love, including things like nesting data within documents, secondary indexes, and the full breadth of analytics and data transformation functions within the aggregation framework, including joining data from other collections, using the
$lookupoperator, and creating materialized views using
Creating a time series collection is straightforward, all it takes is a field in your data that corresponds to time, just pass the new "timeseries'' field to the createCollection command and you’re off and running. However, before we get too far ahead, let’s walk through just how to do this and all of the options that allow you to optimize time series collections.
Throughout this post, we'll show you how to create a time series collection to store documents that look like the following:
As mentioned before, a time series collection can be created with just a simple time field. In order to store documents like this in a time series collection, we can pass the following to the createCollection command:
Pretty fast right? While timeseries collections only require a timeField, there are other optional parameters that can be specified at creation or in some cases at modification time which will allow you to get the most from your data and time series collections. Those optional parameters are metaField, granularity, and expireAfterSeconds.
While not a required parameter, metaField allows for better optimization when specified, including the ability to create secondary indexes.
In the example above, the metaField would be the "source" field:
This is an object consisting of key-value pairs which describe our time-series data. In this example, an identifying ID and location for a sensor collecting weather data.
The metaField field can be a complicated document with nested fields, an object, or even simply a single GUID or string. The important point here is that the metaField is really just metadata which serves as a label or tag which allows you to uniquely identify the source of a time-series, and this field should never or rarely change over time.
It is recommended to always specify a metaField, but you would especially want to use this when you have multiple sources of data such as sensors or devices that share common measurements.
The metaField, if present, should partition the time-series data, so that measurements with the same metadata relate over time. Measurements with a common metaField for periods of time will be grouped together internally to eliminate the duplication of this field at the storage layer. The order of metadata fields is ignored in order to accommodate drivers and applications representing objects as unordered maps. Two metadata fields with the same contents but different order are considered to be identical.
As with the timeField, the metaField is specified as the top-level field name when creating a collection. However, the metaField can be of any BSON data type except array and cannot match the timeField required by timeseries collections. When specifying the metaField, specify the top level field name as a string no matter its underlying structure or data type.
Data in the same time period and with the same metaField will be colocated on disk/SSD, so choice of metaField field can affect query performance.
The granularity parameter represents a string with the following options:
Granularity should be set to the unit that is closest to rate of ingestion for a unique metaField value. So, for example, if the collection described above is expected to receive a measurement every 5 minutes from a single source, you should use the "minutes" granularity, because source has been specified as the metaField.
In the first example, where only the timeField was specified and no metaField was identified (try to avoid this!), the granularity would need to be set relative to the total rate of ingestion, across all sources.
The granularity should be thought about in relation to your metadata ingestion rate, not just your overall ingestion rate. Specifying an appropriate value allows the time series collection to be optimized for your usage.
By default, MongoDB defines the granularity to be "seconds", indicative of a high-frequency ingestion rate or where no metaField is specified.
Time series data often grows at very high rates and becomes less useful as it ages. Much like last week leftovers or milk you will want to manage your data lifecycle and often that takes the form of expiring old data.
Just like TTL indexes, time series collections allow you to manage your data lifecycle with the ability to automatically delete old data at a specified interval in the background. However, unlike TTL indexes on regular collections, time series collections do not require you to create an index to do this.
Simply specify your retention rate in seconds during creation time, as seen below, or modify it at any point in time after creation with collMod.
The expiry of data is only one way MongoDB natively offers you to manage your data lifecycle. In a future post we will discuss ways to automatically archive your data and efficiently read data stored in multiple locations for long periods of time using MongoDB Online Archive.
Putting it all together, we’ve walked you through how to create a timeseries collection and the different options you can and should specify to get the most out of your data.
The above document can now be efficiently stored and accessed from a time series collection using the below createCollection command.
While this is just an example, your document can look like nearly anything. Your schema is your choice to make with the freedom that you need not worry about how that data is compressed and persisted to disk. Optimizations will be made automatically and natively for you.
In the initial MongoDB 5.0 release of time series collection there are some limitations that exist. The most notable of these limitations is that the timeseries collections are considered append only, so we do not have support on the abstraction level for update and/or delete operations. Update and/delete operations can still be performed on time series collections, but they must go directly to the collection stored on disk using the optimized storage format and a user must have the proper permissions to perform these operations.
In addition to the append only nature, in the initial release, time series collections will not work with Change Streams, Realm Sync, or Atlas Search. Lastly, time series collections allow for the creation of secondary indexes as discussed above. However, these secondary indexes can only be defined on the metaField and/or timeField.
For a full list of limitations, please consult the official MongoDB documentation page.
While we know some of these limitations may be impactful to your current use case, we promise we're working on this right now and would love for you to provide your feedback!
Now that you know what time series data is, when and how you should create a timeseries collection and some details of how to set parameters when creating a collection. Why don't you go create a timeseries collection now? Our next blog post will go into more detail on how to optimize your time series collection for specific use-cases.
You may be interested in migrating to a time series collection from an existing collection! We'll be covering this in a later post, but in the meantime, you should check out the official documentation for a list of migration tools and examples.
At the Intersection of AI/ML and HCI with Douglas Eck of Google (MongoDB Podcast)
May 16, 2022