Need suggestion to store click stream data

I am trying to save click stream data in a timeseries collection. There are some parameters which I want to add in meta data like below

  1. Page - Page from which the click happened
  2. App - Application from which the click happened
  3. UserId - User who clicked
  4. Type - ButtonClick, linkClick etc
  5. Session Id - User’s session Is

I am trying to model the data which is feasible for the queries which can be like below

  1. The number of “Clicktypes” in a Page
  2. Average number of sessions in a day
  3. Number of users who clicked a button in a Page/Across all pages

Example Document

{
  "timestamp": {
    "$date": {
      "$numberLong": "1697522494362"
    }
  },
  "meta": {
    "session": "ccb8277b-2598-45ca-9edc-f4ad0726cf36",
    "page": "/home",
    "app": "4c1780ef-1cc3-4bba-95d5-4767ceb6c16c",
    "type": "button_click",
    "userId": "3e084afd-787f-4a11-a0b8-bbc01fa88e70"
  },
  "count": 2
}

With appropriate indices, I am able to get good query performance, but wanted a suggestion if the meta fields I selected are appropriate as it will create too many buckets.
My application has around 30K active users who will visit at-least a single page everyday

Since type and sessionId are unique, can I move the other fields outside the meta and can use them as part of aggregation pipeline match filters ?

{
  "timestamp": {
    "$date": {
      "$numberLong": "1697522494362"
    }
  },
  "meta": {
    "session": "ccb8277b-2598-45ca-9edc-f4ad0726cf36",
    "type": "button_click"
  },
  "count": 2,
  "page": "/home",
  "app": "4c1780ef-1cc3-4bba-95d5-4767ceb6c16c",
  "userId": "3e084afd-787f-4a11-a0b8-bbc01fa88e70"
}

Also, from storage, which design is optimal ?

Since meta is stored only once and only the non-meta field are in documents, having the page, app and userId outside will increase the single document size.

I am inclined towards storing the data inside the meta, but wanted to know if having too many buckets will be a performance hit when the number of documents increase?

Expecting around 6-8M documents per day and max number days the data will be stored is around 45

Hi @Abhishek_Gn and welcome to MongoDB community forums!!

The granularity field in time series definition would play an important role in forming the buckets and eventually the effect the performance of the application.
Choosing granularity is often a trade off between storage and query efficiency. If you store with a very coarse granularity you may have to scan more when reading.
Also, if you are on version 6.3 and above, you can make use of bucketMaxSpanSeconds and bucketRoundingSeconds to specify bucket boundaries and more accurately control how time series data is bucketed.

However, this would also depend on your use cases and the queries that you are trying to perform on the collections.

In a case where you consistently insert data with varied metafield values, there will be an increase in the number of data buckets and you may not be able to fully utilise the storage capacity of these specific buckets, which can lead to increased query latency and decreased throughput.

Based on the two designed that you have shared and the granularity defined, the collection schema which results into lesser buckets and give you optimal performance, you should choose the schema based on that.
If however, one of the design results onto latency issues, you should

  1. Rethink about the meta fields
  2. Perform horizontal or vertical scaling as and when needed.

Warm Regards
Aasawari

Thank you for the input @Aasawari

My requirement has to have query capability on all fields which are currently part of meta. Since indexes cannot be created on non-meta fields, query lookup time is drastically increasing if I move the other keys outside meta.

Also, as per your suggestion, increasing the granularity is one approach which I would want to test out and see the results.

One question regarding granularity - Is granularity only used for bucket’s storage grouping, or is there any data de-duplication applied while storing the data depending on the timestamp.

For example: If I have multiple events coming in every minute for the same meta field, will there be any loss of data if I change the granularity to minutes from seconds ?