I have a collection with data coming from sensors every minute. We are not interested in storing data every minute, we are only interested in the latest readings for each hour. To do this, I create a new document when a new hour begins, update the data in it until a new one comes - then I create a new document and start updating the data in the new document. Example - now it is 17:29 and data came in - let’s update the document with the date “2024-09-03T17:00:00.000Z” until it is 18:00 - then let’s create a document with the date “2024-09-03T18:00:00.000Z” and start rewriting data in it, etc.
Now I want to migrate this collection to timeseries. But I have some questions.
- What granularity is better to choose for the new timeseries collection? “hours” will be ok?
Or should I set bucketMaxSpanSeconds and bucketRoundingSeconds manually? I tried to set 3600 (1 hour = 60 * 60) - but it gave unsatisfactory result - queries became much slower, the collection grew in size.
When I choose “hours” - everything becomes much better. But then - bucketMaxSpanSeconds: 2592000 (= 30 days). Will it work for my scenario? Consider that if 1 sensor generates data every minute, and in 30 days we will get = 30 * 24 * 60 = 43200 records.
- Since we don’t need intermediate data - only the latest for an hour - is it worthwhile to optimize data storage on disk, for example by deleting intermediate documents? That is, instead of storing 60 documents - we will store only 1? Will this give the same effect as deleting 59 documents in a regular collection. I ask this because I cannot update documents in timeseries. The only way is to delete obsolete ones and insert updated ones instead. This will generate a lot of IO redundant operations. Is it worth it? Or is it better to just store all documents in a row and hope that the database will optimize this?