Time Series Data and MongoDB: Part 3 – Querying, Analyzing, and Presenting Time-Series Data

Facebook ShareLinkedin ShareReddit ShareTwitter Share

In Time Series Data and MongoDB: Part 1 – An Introduction we reviewed the key questions you need to ask to understand query access patterns to your database. In Time Series Data and MongoDB: Part 2 – Schema Design Best Practices we explored various schema design options for time-series data and how they affect MongoDB resources. In this blog post we will cover how to query, analyze and present time-series data stored in MongoDB. Knowing how clients will connect to query your database will help guide you in designing both the data model and optimal database configuration. There are multiple ways to query MongoDB. You can use native tools such as the MongoDB Shell command line and MongoDB Compass a GUI-based query tool. Programmatically MongoDB data is accessed via an extensive list of MongoDB drivers. There are drivers available for practically all the major programming languages including C#, Java, NodeJS, Go, R, Python, Ruby, and many others.

MongoDB also provides third-party BI reporting tool integration through the use of the MongoDB BI Connector. Popular SQL-based reporting tools like Tableau, Microsoft PowerBI, QlikView, and TIBCO Spotfire can leverage data directly in MongoDB without the need to ETL data into another platform for querying. MongoDB Charts, currently in Beta, provides the fastest way to visualize your MongoDB data without the need for third party products or flattening your data so that it can be read by SQL-bases BI tool.

In this blog we will cover querying, analyzing and presenting time-series data using the tools described above.

Querying with Aggregation Framework

The MongoDB Aggregation Framework allows developers to express functional pipelines that perform preparation, transformations, and analysis of data. This is accomplished through the use of stages which perform specific actions like grouping, matching, sorting, or shaping the data. Data flowing through the stages and its corresponding processing is referred to as the Aggregation Pipeline. Conceptually it is similar to the data flow through a Unix shell command line pipeline. Data gets input from the previous stage, work is performed and the stage’s output serves as input to the next processing stage until the pipeline ends. Figure 1 shows how data flows through a pipeline that consists of a match and group stage.


Figure 1: Sample data flows through the Aggregation Pipeline

$match is the first stage In this two stage pipeline. $match will take the entire orders collection as an input and provide as an output a filter with the list of documents where the field, “status” contains the “A” value. The second stage will take these filtered documents as input and perform a grouping of the data to produce the desired query result as an output. While this is a simple example keep in mind that you can build out extremely sophisticated processing pipelines leveraging 100+ operators over 25 different stage classes allowing you to do things like transformations, redacting, sorting, grouping, matching, faceted searches, graph traversals, and joins between different collections to name a few. You can work with data in ways that are just impossible with other distributed databases.

With our time-series data we are going to use MongoDB Compass to issue an ad-hoc query that finds the day high price for a given stock. Compass is the GUI tool that allows you to easily explore your data. A useful feature is the ability to visually construct an aggregation pipeline by assembling stages onto a canvas, and then exporting the resultant pipeline as a code for copying and pasting into your app.

Finding day high for a given stock

Before diving into the query itself, recall that our StockGen application, described in part 2 of this blog series, produced 1 month of stock price data for 5 stocks we wanted to track. One of the two collections created is called, “StockDocPerMinute” (PerMinute) and it contains a document that represents a minute of data for a specific stock symbol as shown in Figure 2.


{
            "_id" : ObjectId("5b57a8fae303d36d6df69cd3"),
            "p" : {
                        "0" : 58.75,
                        "1" : 58.75,
                        "2" : 59.45,
...up to…
                        "58" : 58.57,
                        "59" : 59.01
            },
            "symbol" : "FB",
            "d" : ISODate("2018-07-14T00:00:00Z")
}

Figure 2: Sample StockDocPerMinute document

Consider the scenario where the application requests the day high price for a given stock ticker over time. Without the Aggregation Framework this query would have to be either accomplished by retrieving all the data back into the app and using client side code to compute the result, or by defining a map-reduce function in Javascript. Both options are not optimal from a performance or developer productivity perspective.

Notice the sample document has a subdocument which contains data for an entire minute interval. Using the Aggregation Framework we can easily process this subdocument by transforming the subdocument into an array using the $objectToArray expression, calculating the maximum value and projecting the desired result.

Using MongoDB Compass we can build out the query using the Aggregation Pipeline Builder as follows:


Figure 3: First stage is $match stage

Figure 4: Second stage is the $project stage

Figure 5: Third stage is the $addFields stage

Figure 6: Fourth stage is the $group stage

Figure 7: Fifth stage is the $sort stage

We can see the output of the last stage is showing the maximum value per day for each day. Using the aggregation pipeline builder we didn’t need to write a single line of code. For reference, the complete query built by MongoDB Compass in the previous figures is as follows:


db.getCollection('StockDocPerMinute').aggregate([
  { $match: { "symbol" : "FB" } },
  {"$project": 
{ 
"year": {"$year": "$d"}, 
"month": {"$month": "$d"}, 
"dayOfMonth": {"$dayOfMonth": "$d"},
"p": {"$objectToArray": "$p"},
"Symbol":1
} },
    {"$addFields": {"tmax": {"$max": '$p.v'}}},
    {"$group": {
                            _id: {           
                                year: '$year',
                                month: '$month',
                                dayOfMonth: '$dayOfMonth'
                            },
                            "DayMax": {$max: "$tmax"},
                            }
                        },
   {$sort: {_id: -1}}
])

Leveraging Views

MongoDB read-only views can be created from existing collections or from other views. These views act as read-only collections, and are computed on demand during read operations. Since they appear as another collection, you can add a layer of security by restricting access to the underlying collections of the view and just give the client read access to the view. If you want to learn more about access control with Views read the blog post, “Providing Least Privilege Access to MongoDB data”.

To see how views are created, consider the scenario where the user wants to query stock price history. We can create a view on the StockDocPerMinute collection using the createView syntax as follows:


db.createView("ViewStock","StockDocPerMinute", 
[
{"$project": {"p": {"$objectToArray": "$p"}, "d":1, "symbol":1}},
{"$unwind": "$p"},
{"$project": {"_id": 0, "symbol":1,"Timestamp": {
   "$dateFromString": {"dateString":
                           {"$concat": [{"$dateToString":
                                             {"format": "%Y-%m-%dT%H:%M:",
                                              "date": "$d"}},
                                        {"$concat": ["$p.k", "Z"]}]}}}, 
                                        "price":"$p.v"}}
                                    ])

As MongoDB read-only views are materialized at run-time,the latest results are available with each query. Now that the view is defined it can be accessed just like any other collection. For example, to query the first price entry for the “FB” stock using the view we can issue :


db.ViewStock.find({ "symbol":"FB" } ) .sort({d:-1}).limit(1)

You can also use the aggregation framework with views. Here is the query for all “FB” stock ticker data for a specific day.


db.ViewStock.aggregate({ $match: { 
"symbol":"FB",
 "Timestamp" : { $gte: ISODate("2018-06-26"), $lt: ISODate("2018-06-27") }}},
 { $sort: { "Timestamp": -1 }
})

Querying time-series data with third-party BI Reporting tools

Users may want to leverage existing investments in third-party Business Intelligence Reporting and Analytics tools. To enable these SQL-speaking tools to query data in MongoDB, you can use intermediary service called the MongoDB BI Connector.


Figure 8: Query MongoDB data with your favorite SQL-based reporting tools using the BI Connector

The BI Connector service presents a port that resembles a MySQL Server to the client application and accepts client connections issuing SQL queries. The BI Connector service then translates these queries into the MongoDB Query Language (MQL) and submits the query to the MongoDB database. Results are returned from MongoDB and flattened into a tabular structure and sent back to the SQL speaking client. This flow is seen in detail in Figure 8.

To illustrate the MongoDB BI Connector in action, let’s consume our time-series data with Tableau Desktop and the MongoDB BI Connector. Tableau Desktop has a connection option for MongoDB. Using that option and connecting to the port specified in the BI Connector we see that Tableau enumerates the list of tables from our MongoDB database in Figure 9.


Figure 9: Data Source view in Tableau showing information returned from MongoDB BI Connector

These tables are really our collections in MongoDB. Continuing with the Worksheet view in Tableau we can continue and build out a report showing price over time using the View we created earlier in this document.


FIgure 10: Sample Tableau worksheet showing price over time

MongoDB Charts

The fastest way to visualize data in MongoDB is with MongoDB Charts. Currently available in Beta it provides users with a web console where they can build and run reports directly from data stored in MongoDB. With Charts there is no special service that needs to run in order to query MongoDB. There is also no need to move the data out or transform it into a different format to be queried. Data can be queried directly as rich documents stored MongoDB. As with other read-only connections, you can connect Charts to secondary replica nodes, therefore isolating analytics and reporting queries from the rest of the cluster serving the operational time series apps. To see how MongoDB Charts can represent data from the StockGen tool, check out the price over time line graph as shown in Figure 11.


Figure 11: MongoDB Charts showing price over time

At the time of this writing MongoDB Charts is in Beta, therefore details and screenshots may vary from with the final release.

Analytics with MongoDB

In addition to issuing advanced analytics queries with the MongoDB Aggregation Framework, the MongoDB Connector for Apache Spark exposes all of Spark’s libraries, including Scala, Java, Python and R. This enables you to use the Spark analytics engine for big data processing of your time series data extending analytics capabilities of MongoDB even further to perform real-time analytics and machine learning. The connector materializes MongoDB data as DataFrames and Datasets for analysis with machine learning, graph, streaming, and SQL APIs. The Spark connector takes advantage of MongoDB’s aggregation pipeline and rich secondary indexes to extract, filter, and process only the range of data you need! No wasted time extracting and loading data into another database in order to query your MongoDB data using Spark!


Figure 12: Spark connector for MongoDB

The MongoDB R driver provides developers and statisticians a first class experience with idiomatic, native language access to MongoDB, enterprise authentication, and full support for BSON data types. Using the extensive libraries available to R you could query MongoDB time-series data and determine locally weighted regression as seen in Figure 13.


Figure 13: Scatter Plot showing price over time and smoothing of per second data

The R driver for MongoDB is available via the CRAN R Archive. Once installed you can connect to your MongoDB database and return dataframes that can be used in R calculations. The above graph was produced with the following code using R Studio:


library(mongolite)

ts=mongo(collection="StockDocPerSecond", db="Stock", url="mongodb://localhost:27020")

fbquery=ts$find('{ "symbol":"FB"}', fields='{"_id":0, "d":1, "p":1 }')

plot(fbquery$d,fbquery$p, xlab="Date", ylab="Price", main="Price over Time"); 

lines(lowess(fbquery$d,fbquery$p), col=2); 

Summary

While not all data is time-series in nature, a growing percentage of it can be classified as time-series – fueled by technologies that allow us to exploit streams of data in real time rather than in batches. In every industry and in every company there exists the need to query, analyze and report on time-series data. Real business value is derived from the analytics and insights gained from the data. MongoDB enables you to collect, analyze and act on every piece of time-series data in your environment. In this three part series we covered some thought provoking questions with respect to your specific application requirements. In the second blog post we looked at a few different time-series schema designs and their impact on MongoDB performance. Finally we concluded the series showing how to query time series data using the MongoDB Aggregation Framework and MongoDB Compass as well as other methods like using the BI Connector and analytical languages like R.

Prototypes are one thing, but handling terabytes of data efficiently is on a different playing field. With MongoDB it is easy to horizontally scale out time-series workloads. Through the use of replica sets, read-only clients can connect to replica-set secondaries to perform their queries leaving the primary to focus on writes. Write heavy workloads can scale horizontally via sharding. While an in depth analysis of MongoDB architecture is out of scope for these blog posts you can find lots of useful information in the MongoDB Architecture whitepaper.

The Internet of Things (IoT) use case generates a plethora of time-series data. Larger IoT solutions involve supporting a variety of hardware and software devices for data ingestion, supporting real-time and historical analysis, security, high availability and managing time series data at scale to name a few. MongoDB is powering mission critical IoT applications worldwide. For more information on IoT at MongoDB check out the Internet of Things website.