Schema Design for Time Series Data in MongoDB

MongoDB

#Releases
Before you read on, this article is from 2013 and a lot has happened with MongoDB and Time Series data in the intervening years as we continue to innovate, including a 3 part 2018 series and now the latest from 2021 in MongoDB's New Time Series Collections.

This is a post by Sandeep Parikh, Solutions Architect at MongoDB and Kelly Stirman, Director of Products at MongoDB.

Data as Ticker Tape

New York is famous for a lot of things, including ticker tape parades.

For decades the most popular way to track the price of stocks on Wall Street was through ticker tape, the earliest digital communication medium. Stocks and their values were transmitted via telegraph to a small device called a “ticker” that printed onto a thin roll of paper called “ticker tape.” While out of use for over 50 years, the idea of the ticker lives on in scrolling electronic tickers at brokerage walls and at the bottom of most news networks, sometimes two, three and four levels deep.

Today there are many sources of data that, like ticker tape, represent observations ordered over time. For example:

  • Financial markets generate prices (we still call them “stock ticks”).
  • Sensors measure temperature, barometric pressure, humidity and other environmental variables.
  • Industrial fleets such as ships, aircraft and trucks produce location, velocity, and operational metrics.
  • Status updates on social networks.
  • Calls, SMS messages and other signals from mobile devices.
  • Systems themselves write information to logs.

This data tends to be immutable, large in volume, ordered by time, and is primarily aggregated for access. It represents a history of what happened, and there are a number of use cases that involve analyzing this history to better predict what may happen in the future or to establish operational thresholds for the system.

Time Series Data and MongoDB

Time series data is a great fit for MongoDB. There are many examples of organizations using MongoDB to store and analyze time series data. Here are just a few:

  • Silver Spring Networks, the leading provider of smart grid infrastructure, analyzes utility meter data in MongoDB.
  • EnerNOC analyzes billions of energy data points per month to help utilities and private companies optimize their systems, ensure availability and reduce costs.
  • Square maintains a MongoDB-based open source tool called Cube for collecting timestamped events and deriving metrics.
  • Server Density uses MongoDB to collect server monitoring statistics.
  • Appboy, the leading platform for mobile relationship management, uses MongoDB to track and analyze billions of data points on user behavior.
  • Skyline Innovations, a solar energy company, stores and organizes meteorological data from commercial scale solar projects in MongoDB.
  • One of the world’s largest industrial equipment manufacturers stores sensor data from fleet vehicles to optimize fleet performance and minimize downtime.

In this post, we will take a closer look at how to model time series data in MongoDB by exploring the schema of a tool that has become very popular in the community: MongoDB Management Service (MMS). MMS helps users manage their MongoDB systems by providing monitoring, visualization and alerts on over 100 database metrics. Today the system monitors over 25k MongoDB servers across thousands of deployments. Every minute thousands of local MMS agents collect system metrics and ship the data back to MMS. The system processes over 5B events per day, and over 75,000 writes per second, all on less than 10 physical servers for the MongoDB tier.

Schema Design and Evolution

How do you store time series data in a database? In relational databases the answer is somewhat straightforward; you store each event as a row within a table. Let’s say you were monitoring the amount of system memory used per second. In that example you would have a table and rows that looked like the following:

timestamp memory_used
2013-10-10T23:06:37.000Z 1000000
2013-10-10T23:06:38.000Z 2000000


If we map that storage approach to MongoDB, we would end up with one document per event:

{
  timestamp: ISODate("2013-10-10T23:06:37.000Z"),
  type: ”memory_used”,
  value: 1000000
},
{
  timestamp: ISODate("2013-10-10T23:06:38.000Z"),
  type: ”memory_used”,
  value: 15000000
}

While this approach is valid in MongoDB, it doesn’t take advantage of the expressive nature of the document model. Let’s take a closer look at how we can refine the model to provide better performance for reads and to improve storage efficiency.

The Document-Oriented Design

A better schema approach looks like the following, which is not the same as MMS but it will help to understand the key concepts. Let’s call it the document-oriented design:

{
  timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
  type: “memory_used”,
  values: {
    0: 999999,
    …  
    37: 1000000,
    38: 1500000,
    … 
    59: 2000000
  }
}

We store multiple readings in a single document: one document per minute. To further improve the efficiency of the schema, we can isolate repeating data structures. In the ```timestamp_minute``` field we capture the minute that identifies the document, and for each memory reading we store a new value in the ```values``` sub-document. Because we are storing one value per second, we can simply represent each second as fields 0 - 59.

More Updates than Inserts

In any system there may be tradeoffs regarding the efficiency of different operations, such as inserts and updates. For example, in some systems updates are implemented as copies of the original record written out to a new location, which requires updating of indexes as well. One of MongoDB’s core capabilities is the in-place update mechanism: field-level updates are managed in place as long as the size of the document does not grow significantly. By avoiding rewriting the entire document and index entries unnecessarily, far less disk I/O is performed. Because field-level updates are efficient, we can design for this advantage in our application: with the document-oriented design there are many more updates (one per second) than inserts (one per minute).

For example, if you wanted to maintain a count in your application, MongoDB provides a handy operator that increments or decrements a field. Instead of reading a value into your application, incrementing, then writing the value back to the database, you can simply increase the field using $inc:

```{ $inc: { pageviews: 1 } }```

This approach has a number of advantages: first, the increment operation is atomic - multiple threads can safely increment a field concurrently using $inc. Furthermore, this approach is more efficient for disk operations, requires less data to be sent over the network and requires fewer round trips by omitting the need for any reads. Those are three big wins that result in a more simple, more efficient and more scalable system. The same advantages apply to the use of the $set operator.

The document-oriented design has several benefits for writing and reading. As previously stated, writes can be much faster as field-level updates because instead of writing a full document we’re sending a much smaller delta update that can be modeled like so:

db.metrics.update(
  { 
    timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
    type: ”memory_used”
  }, 
  {$set: {“values.59”: 2000000 } }
)

With the document-oriented design reads are also much faster. If you needed an hour’s worth of measurements using the first approach you would need to read 3600 documents, whereas with this approach you would only need to read 60 documents. Reading fewer documents has the benefit of fewer disk seeks, and with any system fewer disk seeks usually results is significantly better performance.

A natural extension to this approach would be to have documents that span an entire hour, while still keeping the data resolution per second:

{
  timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
  type: “memory_used”,
  values: {
    0: 999999,
    1: 1000000, 
    …,
    3598: 1500000,
    3599: 2000000
  }
}

One benefit to this approach is that we can now access an hour’s worth of data using a single read. However, there is one significant downside: to update the last second of any given hour MongoDB would have to walk the entire length of the “values” object, taking 3600 steps to reach the end. We can further refine the model a bit to make this operation more efficient:

{
  timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
  type: “memory_used”,
  values: {
    0: { 0: 999999, 1: 999999, …, 59: 1000000 },
    1: { 0: 2000000, 1: 2000000, …, 59: 1000000 },
    …,
    58: { 0: 1600000, 1: 1200000, …, 59: 1100000 },
    59: { 0: 1300000, 1: 1400000, …, 59: 1500000 }
  }
}
db.metrics.update(
  { 
    timestamp_hour: ISODate("2013-10-10T23:00:00.000Z"),
    type: “memory_used”
  }, 
  {$set: {“values.59.59”: 2000000 } }
)

MMS Implementation

In MMS users have flexibility to view monitoring data at varying levels of granularity. These controls appear at the top of the monitoring page:

These controls inform the schema design for MMS, and how the data needs to be displayed. In MMS, different resolutions have corresponding range requirements - for example, if you specify that you want to analyze monitoring data at the granularity of “1 hr” instead of “1 min” then the ranges also become less granular, changing from hours to days, weeks and months:

To satisfy this approach in a scalable manner and keep data retention easy to manage, MMS organizes monitoring data to be very efficient for reads by maintaining copies at varying degrees of granularity. The document model allows for efficient use of space, so the tradeoff is very reasonable, even for a system as large as MMS. As data ages out, collections that are associated with ranges of time are simply dropped, which is a very efficient operation. Collections are created to represent future ranges of time, and these will eventually be dropped as well. This cycle maintains a rolling window of history associated with the functionality provided by MMS.

In addition, to support the “avg/sec” display option the system also tracks the number of samples collected and the sum of all readings for each metric similar to the following example:

{
  timestamp_minute: ISODate(“2013-10-10T23:06:00.000Z”),
  num_samples: 58,
  total_samples: 108000000,
  type: “memory_used”,
  values: {
    0: 999999,
    …  
    37: 1000000,
    38: 1500000,
    … 
    59: 1800000
  }
}

The fields “num_samples” and “total_samples” are updated as new readings are applied to the document:

db.metrics.update(
  { 
    timestamp_minute: ISODate("2013-10-10T23:06:00.000Z"),
    type: “memory_used”
  }, 
  {
    {$set: {“values.59”: 2000000 }},
    {$inc: {num_samples: 1, total_samples: 2000000 }}
  }
)

Computing the average/sec is straightforward and requires no counting or processing, just a single read to retrieve the data and a simple application-level operation to compute the average. Note that with this model we assume a consistent cadence of measurements - one per second - that we can simply aggregate at the top of the document to report a rolled-up average for the whole minute. Other models are possible that would support inconsistent measurements and flexible averages over different time frames.

Another optimization used in MMS is preallocating all documents for the upcoming time period; MMS never causes an existing document to grow or be moved on disk. A background task within the MMS application performs inserts of empty “shell” documents including the subdocument schema but with all zeroes for the upcoming time periods before they are recorded. With this approach fields are always incremented or set without ever growing the document in size, which eliminates the possibility of moving the document and the associated overhead. This is a major performance win and another example of ensuring in-place updates within the document-oriented design.

Conclusion

MongoDB offers many advantages for storing and analyzing time series data, whether it’s stock ticks, tweets or MongoDB metrics. If you are using MongoDB for time series data analysis, we want to hear about your use case. Please continue the conversation by commenting on this post with your story.

More Information

This post was updated in September 2022 to include additional resources and updated links.

Like what you see? Get MongoDB updates straight to your inbox