Advice needed for how to best model product sales stats. Timeseries collections?

John_Knoop · May 11, 2022, 9:55am

The application is an e-commerce statistics product where we want to track a couple of metrics for every product, such as:

Sales total (we’ve sold this product for e.g. $1234 in total)
Total profit margin (sales total minus acquisition cost)
Average profit margin (e.g. 61.4%)
No of sold units
…etc

We want these numbers aggregated on a daily, monthly and overall level.

On a daily level, we want to retain data for the last 365 days.
On a monthly level we want to retain data for the last 48 months.
This is just to make sure that the collections don’t get too big over time.

Calculating these numbers is a solved problem, so that’s not the challenge. The challenge is how to store it in an efficient and scalable way.

Expected write frequency would be up to a few hundred times a week per product and a few hundred times per day for the whole collection.
Expected read frequency of the whole collection would be in the tens-hundreds per minute when usage spikes.

My first idea was to have a plain old collection (i.e. not a time-series collection) where each document looks like this:

{
	_id: "ProductId",
	salesTotal: 14352.00, // just $inc every time we sell
	totalSoldUnits: 435, // just $inc every time we sell
	averageMarginLast48Months: 0.614, // Can be recalculated from "months.$.averageProfitMargin.margin"
	months: [
		// cap: 48
		{
			key: { year: 2022, month: 1 },
			salesTotal: 14352.00,
			averageProfitMargin: {
				margin: 61.08,
				/* In order to recalculate this margin whenever an order is shipped or returned, we'll
				   need to keep all the numbers needed to compute a new average. Hence this "basedOn" array.
				   So for example when a new order is shipped, we'll just push a new document to the
				   "basedOn" subarray and then update the margin to (sum of basedOn.$.marginAmount / sum of basedOn.$.total).
				   The "basedOn" array should hopefully never grow too large since it'll only contain a months worth of orders. */
				basedOn: [
					{ orderLine: "{ordernumber}_{lineIdentifier}", quantity: 2, total: 234.34, marginAmount: 143.00 },
					...
				]
			}
		},
		...
	],
	days: [
		// cap: 365
		{
			date: "2022-01-01",
			soldUnits: 23,
			returnedUnits: 0
		},
		...
	]
}

And then implement the capping of months and days subarrays using $pushSlice.

This design would lead to fairly large documents. Is there a chance we might hit the max doc size or the max collection size? Would it be better if I moved the months and days subarrays into their own collections? Then I wouldn’t risk hitting the max document size, but I’m not sure how to ensure the cap of 48 (or 365) docs per product. Is there a way to achieve this (in an elegent way without a bunch of db roundtrips)?

A third option would be to use the new time series collection feature. Then I wouldn’t have to store these consolidated numbers but could calculate it on the fly every time using window functions. But would that put a big load on the cluster? Would read times scale poorly with the number of products?

I could do some experimenting by generating a ton of dummy data and try out all three options, but before I do that, I thought it would be nice to hear some learnings from people who have already walked this road.

Prasad_Saya · May 11, 2022, 11:50am

Hello @John_Knoop, here are some thoughts.

My first idea was to have a plain old collection…

It looks like you are storing calculated data (the totals, etc.). You can store just the raw data. And, calculate the totals, averages, etc., at the time of retrieval/querying. Depending upon your use case (performance requirements, size of data, available resources, etc.) you can divide the load of computation between the write and read operations.

My first suggestion is that you store data for one year only in one document per product. This will allow 12 months and 365 days data in the document. In this case, the document size per product is not much (my guess is few 100 kilo bytes of the maximum possible 16 mega bytes).

An option is you create a document initially for a product, and initialize all the fields, including the array fields. For example, the dailyData array field will have 365/366 elements with:

{ date: ISODate("2022-01-01T00:00:00Z"), totalSales: 0, totalUnits: 0 },
{ date: ISODate("2022-01-02T00:00:00Z"), totalSales: 0, totalUnits: 0 },
// ...

Similarly, for the monthlyData with 12 elements initialized.

The document per product would look like this:

{
    _id: "ProductId",
    yearlySalesTotal: 4352.00,
    yearlyTotalSoldUnits: 689,
    year: 2022,
    dailyData: [
        { date: ISODate("2022-01-01T00:00:00Z"), totalSales: 123.45, totalUnits: 39 },
        { date: ISODate("2022-01-02T00:00:00Z"), totalSales: 900.45, totalUnits: 297 },
        //...
        { date: ISODate("2022-05-10T00:00:00Z"), totalSales: 12.45, totalUnits: 2 }
        // ... maximum 366 elements
    ],
    monthlyData: [
        { month: 1, totalSales: 123.45, totalUnits: 39 },
        { month: 2, totalSales: 456.70, totalUnits: 111 },
        // ... maximum 12 elements
    ]
}

The second option is storing raw data for a product and per day in a document.

{
    _id: "ProductId",
    salesTotal: 1435.00,
    totalSoldUnits: 221,
    date: ISODate("2022-05-11T10:00:00Z")
}

This allows minimum computation at the time of writing. When you read the data you aggregate as per your output needs.

The third option is to store data per product per month in a document.

Finally, using some of these ideas you can create a model that suits your needs best.

NOTE: As for maximum collection size, there is no set limit by the server. I believe this is mostly limited by the file size limits of the server hosting the database server.

John_Knoop · May 12, 2022, 5:26pm

HI @Prasad_Saya

It sounds like we see pretty much the same options.

What would tip the scale in favor of option 1 instead of putting every individual transaction in a time-series collection (every single one, not even aggregated by day) and calculate all the numbers on the fly using window functions? Obviously that would shift some compute load from the write side to the read side, and I know that I will have lots more reads than writes, but are time-series collections so performant that it’s not a problem?

Prasad_Saya · May 13, 2022, 3:19am

@John_Knoop, you are in a better position to figure what is best based upon your work factors. I suggest you go over these two articles:

And, as you think about Timeseries collection be aware of this: Time Series Collection Limitations.