Window Functions & Time Series Collections
Rate this article
Window functions and time series collections are both features that were added to MongoDB 5.0. Window functions allow you to run a window across a sorted set of documents, producing calculations over each step of the window, like rolling average or correlation scores. Time-series collections dramatically reduce the storage cost and increase the performance of MongoDB when working with time-series data. Window functions can be run on any type of collection in MongoDB, not just time-series collections, but the two go together like ... two things that go together really well. I'm not a fan of peanut butter and jelly sandwiches, but you get the idea!
In this article, I'll help you get set up with a data project I've created, and then I'll show you how to run some window functions across the data. These kinds of operations were possible in earlier versions of MongoDB, but window functions, with the
$setWindowFieldsstage, make these operations relatively straightforward.
You'll also need the following software installed on your development machine to follow along with the code in the sample project:
Once you have your time series collections correctly set up, and you're filling them with lots of time series data, you'll be ready to start analyzing the data you're collecting. Because Time Series collections are all about, well, time, you're probably going to run temporal operations on the collection to get the most recent or oldest data in the collection. You will also probably want to run calculations across measurements taken over time. That's where MongoDB's new window functions are especially useful.
Temporal operators and window functions can be used with any type of collection, but they're especially useful with time series data, and time series collections will be increasingly optimized for use with these kinds of operations.
I've written some scripts to automate the process of creating a time series collection and importing the data into the collection. I've also automated running some of the operations described below on the data, so you can see the results. You can find the scripts on
, along with information on how to run them if you want to do that while you're following along with this blog post.
At the time of writing, time series collections have only just been released with the release of MongoDB 5.0. As such, integration with the Aggregation tab of the Atlas Data Explorer interface isn't complete, and neither is integration with MongoDB Charts.
I've glued all the scripts together using a task runner called Just. It's a bit like Make, if you've used that, but easier to install and use. You don't have to use it, but it has some neat features like reading config from a dotenv file automatically. I highly recommend you try it out!
First create a file called ".env", and add a configuration variable called
MDB_URI, like this:
Your URI and the credentials in it will be different, and you can get it from the Atlas user interface, by logging in to Atlas and clicking on the "Connect" button next to your cluster details. Make sure you've spun up a MongoDB 5.0 cluster, or higher.
Once you've saved the .env file, open your command-line to the correct directory and run
just connectto test the configuration - it'll instruct
mongoshto open up an interactive shell connected to your cluster.
You can run
db.ping()just to check that everything's okay, and then type exit followed by the "Enter" key to quit mongosh.
You can run
just initto create the collection, but if you're not using Just, then the command to run inside mongosh to create your collection is:
This will create a time-series collection called "stock_exchange_data", with a time field of "ts", a metaField of "source" (specifying the stock exchange each set of measurements is relevant to), and because there is one record per source per day, I've chosen the closest granularity, which is "hours".
If you run
just importit'll import the data into the collection you just created, via the following CLI command:
Note: When you're importing data into a time-series collection, it's very important that your data is in chronological order, otherwise the import will be very slow!
A single sample document looks like this:
In a way that matches the collection's time-series parameters, "ts" contains the timestamp for the measurements in the document, and "source" contains metadata describing the source of the measurements - in this case, the Hong Kong Stock Exchange.
You can read about the meaning of each of the measurements in the documentation for the dataset. I'll mainly be working with "closeUSD", which is the closing value for the exchange, in dollars at the end of the specified day.
Window functions allow you to apply a calculation to values in a series of ordered documents, either over a specified window of time, or a specified number of documents.
I want to visualise the results of these operations in Atlas Charts. You can attach an Aggregation Pipeline to a Charts data source, so you can use
$setWindowFunctiondirectly in data source aggregations. In this case, though, I'll show you how to run the window functions with a
$mergestage, writing to a new collection, and then the new collection can be used as a Charts data source. This technique of writing pre-calculated results to a new collection is often referred to as a materialized view, or colloquially with time-series data, a rollup.
First, I charted the "stock_exchange_data" in MongoDB Charts, with "ts" (the timestamp) on the x-axis, and "closeUSD" on the y axis, separated into series by "source.exchange." I've specifically filtered the data to the year of 2008, so I could investigate the stock market values during the credit crunch at the end of the year.
You'll notice that the data above is quite spiky. A common way to smooth out spiky data is by running a rolling average on the data, where each day's data is averaged with the previous 5 days, for example.
The following aggregation pipeline will create a smoothed chart:
The first step applies the $avg window function to the closeUSD value. The data is partitioned by "$source" because the different stock exchanges are discrete series, and should be averaged separately. I've chosen to create a window over 6 documents at a time, rather than 6 days, because there are no values over the weekend, and this means each value will be created as an average of an equal number of documents, whereas otherwise the first day of each week would only include values from the last 3 days from the previous week.
The second $merge stage stores the result of the aggregation in the "stock_exchange_data_processed" collection. Each document will be identical to the equivalent document in the "stock_exchange_data" collection, but with an extra field, "window.rollingCloseUSD".
Plotting this data shows a much smoother chart, and the drop in various exchanges in September can more clearly be seen.
It's possible to run more than one window function over the same collection in a single $setWindowFields stage, providing they all operate on the same sequence of documents (although the window specification can be different).
The file window_functions.js contains the following stage, that executes two window functions on the collection:
Notice that although the sort order of the collection must be shared across both window functions, they can specify the window individually - the $avg function operates on a window of 6 documents, whereas the $derivative executes over pairs of documents.
The derivative plot, filtered for just the New York Stock Exchange is below:
This shows the daily difference in the market value at the end of each day. I'm going to admit that I've cheated slightly here, to demonstrate the
$derivativewindow function here. It would probably have been more appropriate to just subtract
$last. But that's a blog post for a different day.
The chart above is quite spiky, so I added another window function in the next stage, to average out the values over 10 days:
Those two big troughs at the end of the year really highlight when the credit crunch properly started to hit. Remember that just because you've calculated a value with a window function in one stage, there's nothing to stop you feeding that value into a later
$setWindowFieldsstage, like I have here.
Window functions are a super-powerful new feature of MongoDB, and I know I'm going to be using them with lots of different types of data - but especially with time-series data. I hope you found this article useful!