Is there any convenient way to sum up values in 90 collections?

Hi! I’m currently working with big data that increases around 2 million documents each day. To increase the performance, we split the data to different collections, one collection for data in one day. The problem is we need to calculate sum in 90 days. Is there any convenient way to sum up values in 90 collections? Or is there any other way on how we design our database for this purpose?

I would have used sharding in a case like that.


1 Like

How are you measuring the performance? And what benefit did you see by moving from a single huge collection to 90 smaller ones? I would expect to see worse performance overall due to the added overhead of merging results from these 90 collections.

As Steeve says - you should consider sharding. Some people have had success in the past with “micro-sharding” i.e. creating multiple nodes per host to create large sharded clusters of say 16 shards, built on only 4 hosts by ensuring that each host shares primary nodes and that each shard RS is still distributed over 3 physical hosts. You need to be very careful to set cacheSizeGB correctly when co-locating multiple mongod processes per host.

Hi @1119,

Have you considered creating another collection to help with this? Check out Building with Patterns: A Summary, specifically the Computed Pattern. Instead of aggregating across 90 days worth of documents (potentially millions and millions of documents), it may be worth storing the values that need to be summed in another place.

For example, consider creating a collection called sum that contains the following documents:
{ _id: "20200313_00", values: [ 1, 2, 3, 4, 5 ... ] }
{ _id: "20200313_01", values: [ 6, 7, 8, 9, 10 ... ] }

When you insert a new value into your primary collection (the one you’re currently aggregating), perform another update statement using upsert. Simply append the most recent value to be summed to the values array in a single document within the sum collection.
db.sum.updateOne( { _id: "20200313_01" }, { $push: { values: 11 } }, { upsert: true } )

Your application knows the current date (e.g. 2020-03-13 translates to _id of 20200313) and you know the current hour (e.g. 01:00 translates to 01), and you know the value that you’ll later sum. Keeping a document per hour allows storing lots of data compactly for easy summing.

Later, when you need to retrieve the sum, aggregate on the sum collection, (1) $match a range based on the last 90 days, and (2) $group with sum to reduce array of values across all 90 days to a single value.

You can make this even more scalable by simply storing the sum per day instead of the original values:
{ _id: "20200313", value: 1000 }
{ _id: "20200314", value: 2000 }

You have many options using the computed pattern.

Schema tricks like this are very scalable. Not having to aggregate across millions of documents at a time will likely even remove the need to shard.

1 Like

Nice idea.

Thanks for sharing. The Computed Pattern was a interesting read.

I would be worry about the consistency. But may be doing both update in a transaction could fix that. I have not worked with transactions.

An alternative might be to run a process at the end of the day that update the sum collection with the previous day’s data.

That springs a couple of related ideas.