Timeseries schema design

michael_didomenico · August 21, 2023, 8:33pm

i’m dipping my toe into timeseries collections and i’m afraid a subtle nuance is unclear to me about which data model is better

as i understand it the first block is the general model for a time series collection. where you have some metadata and a bunch of values and each time step is a single document

s> db.stats_proc_diskstats1.find()
[
  {
    timestamp: ISODate("2023-08-21T20:00:20.108Z"),
    metadata: { disk: 'sda', hostname: 'server1' },
    reads_completed: 22893701,
    ...snipped a bunch of statistics... 
    _id: ObjectId("64e3c2546a88d7217de0dd52")
  }
]

as an alternate one could have an array and then use an aggregate frame work to unwind that array and basically produce the same output

s> db.stats_proc_diskstats2.find()
[
  {
    timestamp: ISODate("2023-08-21T20:08:05.146Z"),
    metadata: { hostname: 'server1' },
    diskstats: [
      {
        md: { disk: 'sda' },
        reads_completed: 22893701,
        ...snipped a bunch of statistics...
      }
      ...snipped, repeats for all the disks in the server...
    ],
    _id: ObjectId("64e3c425011a5d2aaa109dce")
  }
]

the first is surely easier to query and not unwinding an array during a query is probably less loading on the mongo server. but the question is, which one is better?

to frame the scope of that bold question, this collection would number into the 100’s of millions. where i’m collecting disk performance statistics every five minutes from hundreds of servers with a 90 day retention

i’m sure there are processing/storage trade off’s inside mongo for each solution, but i my knowledge is still pretty light on the mongo internals

Aasawari · September 5, 2023, 5:11am

Hi @michael_didomenico and welcome to MongoDB community forums!!

When deciding between organising your data as one document per timestamp or using the array model, your choice should be guided by how you plan to retrieve and work with your data.
If you prefer straightforward and easy queries, creating a document for each timestamp is a good option. This approach simplifies your queries and avoids unnecessary complexity and processing. It’s like having individual files for each moment in time, making it easier to find and access specific data.

On the other hand, if you’re concerned about data storage and want to prevent your collection from growing excessively over time, the array model might be more suitable. With this approach, data is stored together in a single document, which helps manage storage space. However, it comes with a tradeoff - writing complex queries to extract specific information from the array can be more challenging.

So, the choice ultimately depends on your data retrieval needs and how you want to balance simplicity in querying with efficient data storage

Yes, you are right here that this would involve a processing time trade off but you could make the choice based on how you would like to process the queries.
In saying so, we know that time series collections can handle large number of documents, therefore huge collection size should not be a concern.
The recommendation we have for you is to make use of the TTL Indexes which could help you to clear up space by removing the old data.

Please feel free to reach out in case of further questions.

Warm Regards
Aasawari

system · September 10, 2023, 5:12am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.