Schema Design for high frequency time series

Clay_Smyth · April 3, 2022, 9:44pm

Hi! I’m new to mongoDB and would love advice on appropriate schema design for our use case. Please excuse me if this post is inappropriate or off-topic.

I am an electrophysiologist, and our lab collects high frequency, voltage time-series data sampled from biological systems. Essentially, we run ‘experimental sessions’, where the data collected from an experimental session is voltage measurements sampled at either 250, 500, or 1000 Hz, along with other quantitative. Each experimental session can last up to a few hours (collected maybe a few gigabytes of data), and has descriptions the experimental setup (e.g. the experimenter, device used for collection, biological system measured from, etc…).

I am considering creating a time-series collection, where each document contains fields that describe the experimental setup, and also contains a data field. The data field will contain subdocuments, where each subdocument is a datapoint from that experimental session.

For Example:

{
‘session#’: 1
metadata: {
    ‘experimenter’: ‘Tony’
    ‘sample_rate’: ‘500’
    ‘system#’: 12
     }
data: {
           {
           timeStamp: UnixTimeStamp1,
           ‘voltage’: 1e-3,
           ‘PowerBand1’: 1e-5},
           {
           timeStamp: UnixTimeStamp2,
           ‘voltage’: 1.5e-3,
           ‘PowerBand1’: 2.3e-5}
      }
}

I’ll often query on ‘session#’, metadata like ‘system#’, or obtaining voltage when ‘PowerBand1’==SomeValue. Is this a good use for the time series collection?

Alternatively, I could create a normal collection, where each document is a session, and input the data as a parquet file into a ‘data’ field for each document. Is there a best way to proceed? Any advice is welcome, thank you!

Prasad_Saya · April 4, 2022, 3:32am

Hello @Clay_Smyth, here is some additional information you can use.

The important aspect about any data and its application is to know how the data is going to be used. In this case the kind of queries and the query filter (filter specifies the field on which the query is made, e.g., { session_no : 1 } ) - especially the important and the most often used queries matter. For efficient querying, you define index on the filter field, or on multiple fields depending upon the query filter. Indexes on multiple fields are referred as Compound Indexes.

The index(es) can be defined with any kind of collections - time series or otherwise.

Also, refer the MongoDB blog post Building with Patterns: The Bucket Pattern, discusses storing time-series data; which says:

This pattern is particularly effective when working with Internet of Things (IoT), Real-Time Analytics, or Time-Series data in general. By bucketing data together we make it easier to organize specific groups of data, increasing the ability to discover historical trends or provide future forecasting and optimize our use of storage. …

You are storing data based upon the time field - timeseries allows this and you can take advantage of some in-built optimizations. Further, you dont have to think much about about designing a schema.