Will NaN entries consume much space?

jayzee · February 20, 2022, 8:57pm

Hey,

I am filling a time series db. The db has multiple columns / fields, however different of these are filled for different documents, leaving the other available fields empty / NaN. The question is whether these fields would consume a lot of space?

Thanks!

steevej · February 20, 2022, 10:45pm

If the field will never receive a value for certain document, I feel it is better to leave the field out rather than having a null or NaN value. Space wise even, if the value does not take a lot of space, the field also take some.

However, if the field will eventually receive a value, I feel it is better to have the field. Otherwise, when the field is not present, you may incur the cost of having to move the document on disk during an update if the block containing the document is not big enough for the new field and value. More or less, you reserve the space for further update. This also tells me that sometime this field is used for this document. This way I can locate the related code.

I do not know enough about the storage engines to know for sure but I have a good feeling about this.

jayzee · February 21, 2022, 12:11pm

Hey Steve,

thanks for your reply. Not 100% sure if I got your response right.
As an example, my data looks like this:
Doc 1: {“a”:1,“b”:2,“c”:NaN}
Doc 2: {“a”:NaN,“b”:2,“c”:3}
…

so, a-c are the columns / fields. Imagine I have 100s of fields and many documents, and some of the documents will fill some of the fields. I never have a document that fills all fields. So, when I want to put all these documents together in the same collection, I will basically have NaN values present in every document, depending on which fields it has or has not. Say, I have hundreds of thousands of documents. This may easily result in millions of NaN values for empty fields in my collection. The question was whether this is inexpensive to store for the mongodb, or whether these NaNs would unnecessarily consume a lot of space? If the latter was the case, I would divide the documents into sub-collections, but ideally I want to dump them all in the same one.

Thanks!

Best, JZ

Stennie_X · February 21, 2022, 1:16pm

Hi @jayzee,

If you have fields that are not present in all documents, I would take advantage of MongoDB’s flexible schema and only include the fields that have actual values.

It would be worth reviewing some of the common schema design patterns/anti-patterns to see what may apply to how you are modelling your data:

@steevej: The described behaviour of in-place updates and record padding only applies to the legacy MMAP storage engine. The WiredTiger storage engine (default since MongoDB 3.2) always writes a new copy of a document, but is designed for higher concurrency and supports more advanced features. For a more detailed explanation, please see my response on WiredTiger and in-place updates.

Regards,
Stennie

steevej · February 21, 2022, 11:37pm

Thanks for the correction.

It makes sense that the whole document is written again on update. That probably leads to better compression too.

jayzee · February 22, 2022, 9:36am

Hello,

thanks for the reply. Looks like I am dealing with an attribute pattern.

So, I will probably arrange my documents as follows:

{‘t’:time,‘metadata’:{…},‘data’:[{‘a’:1},{‘b’:2}]}
{‘t’:time,‘metadata’:{…},‘data’:[{‘b’:2},{‘c’:3}]}

Then there would be no nan entries for missing values (of c or a in this case), right?

In the end, they way I would like to query my distributions on the time series collection is the following:

query: data with t from t1 to t2, metadata in {metadata1, metadata2, …}

metadata will contain two fields ‘number’ and ‘type’.
number will select the corresponding time series, while type will select the type of documents (with specific keys) that are stored in data. For a given value of type, the documents in data will have identical keys, so they can then be easily collected into a pd.DataFrame.

is that about right?