this is actually a question on best practice or a kind of brainstorming in case this is new territory…
A system saves metrics in a timeseries collection, let’s take 20 Mio or more per month. The data life cycle requires to get rid of aged data, that could be archived with the expireAfterSeconds option or online archiving. Also there are requirements to run some analytics on this data for bigger time windows e.g. find trends or compare to last quarter / year…
If the data is deleted we are out of luck, a high volume in a Data Lake is not good for performance.
Solution before timeseries collections
Back to the times before timeseries collections: I would have used the bucketing pattern with the precompute pattern on top writing the precomputed data to a further collection so that I could purge the fine aggregated data in my buckets.
How to do this (better) with timeseries collections?
The timeseries collections do a great job bucketing the data and making it accessible. What would be the smartest procedure to build on top some precomputing to gather the data which has a higher aggregation but would also be much less terms of volume?
The goals are:
keep the granular timeseries data for a short time to run small window analytics
aggregate the data / precompute for analytics over bigger time windows
purge granular timeseries data and keep aggregated data
Make things real life
Let’s assume the data is not a constant stream, instead of this we have offline devices which send batches of metrics when the devises come online… So we deal with a lot of late arrivals
Any thoughts on this? Do we have already some best practices? I am pretty sure that I am not the first person to think about this.
a high volume in a Data Lake is not good for performance.
Modelling your aggregated data to support efficient historical queries could help with some of the performance concerns, but you will have to decide on the right balance of performance versus cost for your use case. If you are using Atlas Data Lake and Online Archive, you can adjust the archiving rules and partitions in in your online archive to more efficiently support common queries.
Solution before timeseries collections
Since Time Series Collections are a relatively new feature with some limitations in MongoDB 5.0, they may not entirely replace previous solutions where you have more granular control over data modelling.
However, there are further Time Series enhancements on the way and any feedback on use cases that could be better addressed is extremely helpful.
FYI the MongoDB 5.1 Rapid Release improves Time Series Collections with support for sharding, densification, and delete operations. There is also a preview of time series support in Atlas Online Archive (requires a MongoDB 5.0+ cluster).
I think late arrivals should be compatible with materialised views as long as you can recalculate and merge aggregated data. Is there a maximum delay you would tolerate for late arrivals? Depending on the pre-aggregation calcs you are performing, allowed tolerance for late arrivals may determine the TTL on your original data.
The focus of the above is how to build some precomputing on top of the new timeseries collections. This should archive the following:
keep the fine granular timeseries data for a certain time
keep the data higher aggregated for long terms
be enabled to purge / archive the fine granular data
So the bottom line question is how to build a percomputing mechanism on top the timeseries collections.
Materialized views are the general weapon for this, however. Scheduled vs. triggered? Any new interesting to trigger? Taking late arrivals into consideration materialized views might not be the best option since: when one needs to go too far back into the past to catch all late arrivals the compute time might exceed the allowed runtime of a triggered function. Going back to the quote: are there any other smart ways to deal with this? E.g. timeseries collections have already some precomputed values per bucket. Are there concepts around to utilize these values e.g. for computing a higher aggregation level? E.g. Timeseries comes with control.min and control.max
Timeseries collections can now utilize the online archiving, but as in the initial post. Archiving huge volumes and querying this on a Data Lake is very slow, so this would require further precomputed data on the hot data side before the data gets archived since the online archiving has (currently) no logic to aggregate data. This would not be my prefered option. Before to buid pre aggregation on a Data Lake I’d go for hot/war/cold data and keep the pre aggregation on the warm data side in a less powerfull cluster…
Anyone out there who has the issues? Maybe we can identify some best practises here?