Questions related to efficient querying of avg in timeslot?

Bob_Doe · October 31, 2021, 7:33pm

I’m probably going to go a bit wide here, but bear with me. Lets say I have a MongoDB 5 time series which is essentially:

{
  time, // The timeField
  metadata: {  // The metaField
    key // The "id" for this particular reading
  },
  value // some value, mainly numeric or boolean or perhaps string
}

I have several questions about being somewhat efficient when the amount of documents are in the billions. Lets pretend I have a couple of keys, numeric_key1, numberic_key2… numeric_keyN.

I’m going to assume the following indexes exist: time, metadata.key and time + metadata.key

And I can expect updates to come in every second for thousands of inputs, so grunularity is in seconds.

As long as all queries/aggregations are matching a specific key or set of keys, does splitting into different databases or different collections matter at all? I have a hunch: no. Assuming you drop the key matching, I’m guessing: yes

If doing aggregations to get e.g. 1 minute averages quite often, does it ever make sense to store the finished aggregate values to another collection for simplified lookups? Are there any mechanism to achieve this?

Being quite new to MongoDB and the aggregation framework, is there any obvious way to improve upon the following aggregation query?

db.mytimeseriescollection.aggregate(
 { $match: { "metadata.key": { $in: [ "numeric_key1, numeric_key2, numeric_keyN" ] },
 "time": { $gt: ISODate("2021-11-01T01:00:00.000+00:00") } } },
 { $project: { "value": 1, "key": "$metadata.key", "y": { $year: "$time" }, "mon": { $month: "$time"  }, "dom": { $dayOfMonth: "$time" }, "h": { $hour: "$time" }, "min": { $minute: "$time" } } },
 { $group: { "_id": { year: "$y", month: "$mon", day: "$dom", hour: "$h", minute: "$min" }, "value": { $avg: "$value" }, "key": { $first: "$key" } } },
 { $sort: { "_id": 1 } }
)

Thanks for taking the time, and any help given!