Designing a flexible bucket MongoDB Collection

Chariot · February 14, 2022, 1:27pm

I am designing Collection Schema for a upcoming system. We required flexibility in that the topic of the documents could be anything new anytime and they could be any sort of information. We have also done some research and found Massive Number of Collections | MongoDB . So we decided to design our Collection this way:

_id: { string },
owner: { string },
start_date: { date }, 
end_date: { date },
topic: { string },
keys_values: { Array of MAP }

The data will look like this per document:

_id: "Mr. John Smith|2019-05-01T00:00:00.000+00:00|2019-05-02T00:00:00.000+00:00|sold product",
owner: "Mr. John Smith",
start_date: "2019-05-01T00:00:00.000+00:00", // May 1st
end_date: "2019-05-02T00:00:00.000+00:00",   // May 2nd
topic: "sold product",
keys_values: [
	...,
	{
		timestamp: "2019-05-01T18:00:00.000+00:00",
		values: {
			initial_cost: 100,
			selling_cost: 300,
		}
		custom_tags: {
			product_SKU: "12345",
			branch: "New York",
			company: "My Company"
		}
	},
	{
		timestamp: "2019-05-01T19:00:00.000+00:00",
		values: {
			initial_cost: 200,
			selling_cost: 300,
		}
		custom_tags: {
			product_SKU: "23456",
			branch: "Washington",
			company: "My Company",
			promotion: "SOME SPECIAL PROMO"
		}
	},
	...
]

From this design, currently, we got several concerns that we would like to discuss:

The article Massive Number of Collections | MongoDB has sensors stored data hourly in an hourly bucket. For us, our use cases suggest us to store data in a daily bucket. We are doing some statistics calculation and inspecting data from multiple angles.

Since a single document could only be 16 MB big. This means, if it exceeds 16 MB, we could simply create another document and have these documents sharing identical _id. When we query, we will loop through all “big” documents so this should be fine.

However, our system might receive huge amount of data all at once in one go. But just in case some data needs to flow into a single document and these data could be more than 16 MB. How can we manage this? How could we measure the size of the data and split into chunks accordingly for MongoDB in this case? What functions must I look for to manage this?

From number 1, if we are to add more data later and these data fall into the buckets of these big documents, how could we manage this? Is there a way to ask MongoDB to try push more data in documents less than size X only until it is about to hit the 16 MB limit in this case? Alternatively, is it possible to know whether the data that are to be added to this document will exceed the 16 MB limit? In fact, for big documents in Megabytes, is it recommended to add more data or should I create a new document entirely for this?
For timestamp field in the key_values, is it possible to add index to them? For example, suppose I query for a certain _id that is of March 1, 2021, and I would like a time range of 10 AM to 11 AM only, without indexing, this means MongoDB must go through all the timestamp in this key_values . Is there some sort of indexing method that instructs MongoDB to look for primary index _id first then timestamp existed indexes only in this document afterward?
Is the schema design above logical or should I not do it this way?

Currently, we are using Node.js and Mongoose.

Pavel_Duchovny · February 15, 2022, 10:08am

Hi @Chariot ,

Bucketing data makes sense for grouped data that needs to be analysed together under a specific classifier. In MongoDB 5.0 we introduced Timeseries collections that optimise buckets based on time data which seems like exactly what you are doing here.

The timeseries collections automatically open new buckets and link documents for you. You can control the bucket granularity which can be hours for your use case.

Have you tried using this type of collections?

You can index inner objects , with multikey indexes and use them in compound queries.

I would vote against using the _id with those long strings. The _id is for internal use mostly.

Thanks
Pavel