Json format too bloated

karto_sack · November 24, 2020, 12:47am

we store not too small json documents in mongodb. We will add 10s of millions of documents every month.

We found that storing 30 mio documents eats up 2 TB of data.

As the format is json, there is a lot of repetition in each document (every key repeats in every document). So we started to shorten the keys and could reduce the size to 300 GB.

That’s good but I feel its still way too much. If we wouldn’t have the json boilerplate we still could reduce it by a factor of 10 easily again.

On the other hand I feel that messing around with json keys and transforming them from something human understandable like ‘baseCurrencyAmount’:‘EUR’ into something barely readable like ‘baCA’:‘EUR’ is in general the wrong direction. But you also see that the actual value is still smaller than the key.

Is there any hint you can give? Is this where a document-database is not the right thing and we need to go to good old SQL?

And side question: it seems that mongo compression compresses each document separately. As if would compress it somehow jointly, my guess is that the key lenghths would not play a role any more.

Pavel_Duchovny · November 24, 2020, 6:14am

Hi @karto_sack

Welcome to MongoDB community!

Shorten document keys is a known way to compress document size.

However, there several additional ways:

If you have small documents who are accessed together and have similar values for all you may consider using a bucket pattern to store the change values in an array. This is as long as your arrays keep reasonable number of elements.
Building with Patterns: The Bucket Pattern | MongoDB Blog
You can change the default compression of the storage layer to allow better compression over some additional cpu overhead:
https://docs.mongodb.com/manual/core/wiredtiger/#compression
You can always consider sharding and spreading data cross different replica sets allowing u to scale the collection.

Best
Pavel

mknasiecki · November 11, 2023, 3:12pm

Hi @karto_sack
reducing document size by altering json structure is a very interesting topic. I did some experiments (I put links to details in the end of this post) and in a nutshell, here is what is worth considering:

using short keys - like you said
dropping unused fields (f.i. if you use Spring - very popular Java framework - it adds _class field to every document, and it’s useless unless you use class inheritance)
using proper data types: many developer don’t check how their data is serialized and they don’t know that sometime numbers and dates are serialized as strings which is not optimal in terms of storage size.
use ObjectId instead of strings
if you have fields that could be used as default value you can omit serializing them, f.i. If 99% of amounts in your DB is for USD currency - just skip this symbol and save only currencies other than USD, the same you can do for flags: if 99% of users have some flag set to true just save only those that have false - you can save a lot of space
skip serialization of empty fields i.e. avoid: {someField: null}
consider using flat structure instead of complex types, f.i. instead of {user: {name:'John', surname: 'Doe'} you can do it like: {usrName: 'John', usrSurname: 'Doe'} - there will be some gain if you have millions of documents

You can find results of my experiments here: Impact of data model on MongoDB database size Series' Articles - DEV Community

Regards,
Michał.

steevej · November 11, 2023, 4:22pm

I have not read the report yet but I will. But with what you share on the post I can certainly say Thanks for sharing this. Very informative. I want to emphasis

not only for size considerations but for performance too and for the richer API that dates as Date provide.

I have reservation about

but I do not have the time today to write about it. I will come back and clarify my thoughts.

steevej · November 12, 2023, 3:16pm

My reservations about flat structure vs object tree structure.

That is documents with flat schema like

Flat = { _id : 1 , userName : "John" , userSurname : "Doe" }

versus documents with object tree structure like

Object = { _id : 1 , user : { Name : "John" , Surname : "Doe" }

My argument is that the Object is more code friendly than the Flat one.

1 - It is easier to $project in or out only the user information with Object

{ "$project" : { "user" : false } }

vs

{ "$project" : { "userName" : false , "userSurname" : false } }

If we throw schema evolution in, the $project does not have to change with the Object schema. It does with Flat.

2 - The code is cleaner with Object

A function that process user information could be called with Object.user and will work with generic field names like name and surname and will not have access to fields outside the user object. With Flat, you would need to pass each user field separately or pass the whole document. If you pass the field separately you may write code that works with generic field name but if you pass the document, the function may access fields that are not related to the user. But let say, you have another collection where the users are correspondant, so you would have a “from:” object and a “to:” object. The same function would work with “Object.from” and “Object.to”. Using Flat would be more complicated.

After all, O in jsOn is for object. The arguments in favour of object-oriented programming would go down in the drain if the flat structure would be better for programming. After, most of the time when you do SQL you map your multiple tables and records into Objects.

Now, as for size consideration, I was not convince that Flat was better. Intuitively, I was surprised that repeating the user prefix to make the Flat structure would result in a smaller size. After all, user, name and surname is 4 characters less userName and userSurname. So why using short keys would be better in one case and not in this one.

So I came up with a bigger document structure.

Object = {
  _id: ObjectId("6550dd9de0d1abae5c24671b"),
  user: {
    name: 1,
    surname: 2,
    alias: 3,
    phone: 4,
    address: {
      street: 5,
      city: 6
    }
  }
}

versus

Flat = {
  _id: ObjectId("6550d897e0d1abae5c246719"),
  userName: 1,
  userSurname: 2,
  userAlias: 3,
  userPhone: 4,
  userAddressStreet: 5,
  userAddressCity: 6
}

The average object size for Object is 114B while the Flat is 127B as indicated by $collStats. So my guess is that with only user Name and Surname the sweet spot where Object overhead is smaller was not reached.

Finally, even

{_id:1,user:{Name:"John",Surname:"Doe",Phone:1}}

is less characters than

{_id:1,userName:"John",userSurname:"Doe",userPhone:1}

To terminate, I want to say that all other points are valid and want to add to

this will be a double bonus because if someField use a sparse (usually smaller) index.

I am one of the click on the link you shared. Thanks again for sharing.

mknasiecki · November 12, 2023, 6:49pm

Wow Steeve, nice findings, thank you for pointing out my mistake.
Good to know I’m not the only person in the world counting every byte in json.

steevej · November 13, 2023, 1:46pm

We are fewer and fewer.

Not a mistake. The spark that forced a deeper understanding.

“I have not failed. I’ve just found 10,000 ways that won’t work.” - Thomas Edison

Just to be clear my main argument was not about saving space. It was about to keep using an Object vs Flat schema. My conclusion would have been the same even if the space requirement would have been bigger. Sometimes, performance should not be the first goal. Otherwise we would still be doing everything in assembler.

I was just very happy that the more functional and elegant schema ended up having better storage metrics.