Referencing often updated parts of document instead of including for performance?

Michael_Niemand · March 11, 2022, 11:04am

Hey guys,

new to the forum, so pls be gentle.
I’m operator of an application using mongoDB. We use Atlas, which is great.
We have one very active collection which I have a modelling question about:

The document has lots of fields, but only very few are updated regularly. But these very often (like every 30seconds or so).
Would it improve performance to pull the regularly updated parts out into their own collection and reference them instead of including them in the big documents? This way, only the new collection (possibly even using the new time series collections?) would be updated often.

Pavel_Duchovny · March 11, 2022, 11:18am

Hi @Michael_Niemand ,

We usually stand clearly that data that is accessed together should be stored together. So if when you query the data you need all the other fields, regardless of thier update frequency they should be stored together.

Now we have another known antipattern which is dont frequently query large documents and try to avoid getting close to the 16mb limit per document.

Now let me ask you the following :

What is the average size of those large documents? You can see it on the Atlas data explorer when standing on that collection.
Will you need only those frequently updated fields in your access pattern? Then it make sense to seprate them …

Timeseries collection make sense for time series data, I would not assume it is good for other types…

Read here more about what is time series data and why to use timeseries collections…

Thanks
Pavel

Michael_Niemand · March 11, 2022, 11:30am

Thanks Pavel for the lightning fast reply, it is highly appreciated!
the entire collection is
368.89MB with 58913 total documents and 181.03MB index size, so we’re far from the 16 MB limit. The values we update in those fields are in fact time series data, but right now we use InfluxDB for that and only update the MongoDB collection with the latest value regularly. Hence the idea with using the time series DB.

So long story short: You don’t think updating only few fields of “big” documents is detrimental to performance, correct?

Pavel_Duchovny · March 11, 2022, 11:35am

Hi @Michael_Niemand ,

Well if its time series data updating constantly sounds wierd … How is time series data updated? Are you pre aggregate it and push it into arrays inside documents?

I would like to better understand your use case… How is the document with 30+ fields related to a time series sample?

Ty

Michael_Niemand · March 11, 2022, 12:58pm

each document represents a computer, the influxDB holds the time series data for temperature, cpu load etc. But a map in the unit document is updated at an interval with the latest value, like 50 degrees, 30% or whatever

Pavel_Duchovny · March 12, 2022, 11:39am

Hi @Michael_Niemand ,

So it’s not timeseries data per say but just the latest sample for each computer.

I would recommend embedding those values.

Now another recommdation is to move influx into MongoDB with timeseries collections…

Thanks
Pavel

system · July 20, 2022, 11:54am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.