Why does a Mongodb collection have more size from PostgreSQL table size

harris · May 21, 2021, 12:11pm

Hi guys.I have the same table with the same indexes on Mongodb and PostgreSQL.The table in Mongodb is about 370mb and on PostgreSQL is about 270mb.Why does this happening?Is there an explanation?

harris · May 21, 2021, 1:26pm

@Pavel_Duchovny Sorry for the disturbing.if you have spare time please check this!Thanks in advance

MaBeuLux88_xxx · May 21, 2021, 1:40pm

Hi @harris,

It’s impossible to answer this question without more details.
Which versions are you using? Which configurations are you using for both? Anything specific? Which storage engine are you using?
Can you share that collection publicly so we have a chance to see how it’s designed and if some improvements can be made?
Is PostgreSQL using any kind of compression or was setup in a way to optimize row compression?

Cheers,
Maxime.

Asya_Kamsky · May 21, 2021, 4:40pm

In addition to @MaBeuLux88_xxx questions I’d also ask how to checked size. Is this on-disk size? In memory size?

harris · May 21, 2021, 4:49pm

I am storing my data locally on my pc.I have the default configuration on both of them.I am using 4.4 version in mongodb and postgres12.I dont use any kind of compression.The commands i execute for mongodb is
db.mongodb2indextimestamp1.totalIndexSize() for index only
db.mongodb2indextimestamp1.dataSize() for collection size without index

and in psql i use
SELECT pg_size_pretty( pg_total_relation_size('oneindextwocolumns') ); for table size with index
SELECT pg_size_pretty( pg_indexes_size('oneindextwocolumns') ); size of indexes only

Are these commands equal?

Asya_Kamsky · May 21, 2021, 5:38pm

totalSize() actually includes index sizes, you want storageSize() to exclude indexes.

harris · May 21, 2021, 5:42pm

dataSize gives me the same result with collstats.size https://docs.mongodb.com/manual/reference/command/collStats/#std-label-collStats-output so i thought its the same thing.

harris · May 21, 2021, 5:45pm

from db.mongodb2indextimestamp1.stats() i get
StorageSize() 63mb, size 370mb, totalindexsize 36mb, totalSize 99mb
Yes i edited the the above answer.I meant dataSize not totalSize

harris · May 21, 2021, 6:12pm

I just read that Postgres compresses values > 2K (including json and jsonb) automatically. Maybe its because of that…

Asya_Kamsky · May 21, 2021, 6:29pm

WiredTiger also compresses data - that’s why you see total size be almost 1/4th of data size…

Asya

harris · May 21, 2021, 6:31pm

So whats the collection actual size so i can be able to do the fair comparison in Postgresql?

Thanks in advance
With respect Harris

Asya_Kamsky · May 21, 2021, 7:10pm

Depends on why you are doing the comparison. If it’s to figure out how much disk to buy it would be storage size, if it’s how much RAM you need it would be … more complicated

Keep in mind that the same ratio wouldn’t hold for different collections with different schema and access patterns. Also, if you want to optimize the amount of disk space used, MongoDB has other compression options which can compress data further.

Asya

harris · May 21, 2021, 7:19pm

I want to do a fair compare to psql’s command pg_total_relation.Should i use Datasize command and then add totalindexsize for mongodb?

MaBeuLux88_xxx · May 21, 2021, 7:45pm

Hi again,

This got me interested so I started to write some code to put this to the test a little.

I need MongoDB and Postgres running on my machine:

docker run --rm -d -p 27017:27017 -h $(hostname) --name mongo mongo:4.4.6 --replSet=test && sleep 4 && docker exec mongo mongo --eval "rs.initiate();"
docker run --rm --name postgres -p 5432:5432 -e POSTGRES_USER=max -e POSTGRES_PASSWORD=password -d postgres:13.3

I use python to generate fake documents and insert them into Postgres and MongoDB.

A document looks like this:

{
	"_id" : 0,
	"string" : "Modi adipisci velit ipsum amet non ut labore aliquam voluptatem quisquam sit magnam porro velit consectetur magnam quisquam velit quisquam sed aliquam labore consectetur voluptatem adipisci voluptatem dolorem ipsum ut non dolorem quaerat est amet tempora dolore modi aliquam adipisci non sed neque velit ipsum porro sed velit aliquam dolore eius quaerat neque tempora ut velit neque velit neque magnam labore labore eius numquam labore dolorem dolore ipsum neque est sit voluptatem dolorem quiquia voluptatem consectetur ut quaerat eius ut aliquam magnam voluptatem quiquia.",
	"int" : 857663
}

The python algo is pretty simple:

reset the collection / table.
create an index on “int” (MongoDB already has one on _id and so does Postgres on its primary key I suppose).
insert 500 times 1000 docs / rows into both systems.
check the sizes

For the string generation, I used Lorem Ipsum with a random length between 5 and 750 words and limited the final length to 5000 max for Postgres…

from faker import Faker
from lorem.text import TextLorem
from postgres import Postgres
from pymongo import MongoClient

fake = Faker()

lorem = TextLorem(srange=(5, 750))  # nb words min & max
LOOP = 500
NB_DOCS_PER_LOOP = 1000


def rand_docs(id_start, nb):
    return [{
        '_id': id_start + i,
        'string': lorem.sentence()[0:5000],  # 5000 maximum characters for Postgres...
        'int': fake.pyint(min_value=0, max_value=999999)
    } for i in range(nb)]


if __name__ == '__main__':
    client = MongoClient()
    pg = Postgres(url="postgresql://max:password@localhost/max")
    db = client.get_database('max')
    coll = db.get_collection('coll')
    coll.drop()
    coll.create_index("int")
    pg.run("DROP TABLE my_table")
    pg.run("CREATE TABLE my_table(pk SERIAL PRIMARY KEY, string VARCHAR(5000), int integer)")
    pg.run("CREATE INDEX my_index ON my_table(int)")

    for loop in range(LOOP):
        print(f'Loop {loop + 1}/{LOOP} => Inserted {NB_DOCS_PER_LOOP} docs in MDB & rows in PGS.')
        id_start = loop * NB_DOCS_PER_LOOP
        docs = rand_docs(id_start, NB_DOCS_PER_LOOP)
        coll.insert_many(docs)
        pg.run("INSERT INTO my_table(string,int) VALUES " + ','.join('(\'' + i.get('string') + '\',' + str(i.get('int')) + ')' for i in docs))

    mdb_stats = db.command("collstats", "coll")
    print("MongoDB")
    print("Nb docs in MDB: " + str(mdb_stats.get('count')))
    print("Storage size  : " + str(mdb_stats.get('storageSize') / 1000000) + ' MB')
    print("Indexes size  : " + str(mdb_stats.get('totalIndexSize') / 1000000) + ' MB')
    print("\nPostgres")
    print("Nb rows in PGS: " + str(pg.one("SELECT count(*) from my_table")))
    print("Storage size  : " + str(pg.one("SELECT pg_size_pretty(pg_total_relation_size('my_table'))")))
    print("Indexes size  : " + str(pg.one("SELECT pg_size_pretty(pg_indexes_size('my_table'))")))

Results:

Loop 1/500 => Inserted 1000 docs in MDB & rows in PGS.
Loop 2/500 => Inserted 1000 docs in MDB & rows in PGS.
...
Loop 500/500 => Inserted 1000 docs in MDB & rows in PGS.
MongoDB
Nb docs in MDB: 500000
Storage size  : 453.238784 MB
Indexes size  : 24.305664 MB

Postgres
Nb rows in PGS: 500000
Storage size  : 536 MB
Indexes size  : 24 MB

In this scenario, with this particular schema and data generation, the index sizes are similar but MongoDB uses 15.44% less storage than Postgres for the data.

If I wanted to optimize my MongoDB storage even more, I could also use the zstd compression algo instead of the default snappy to save even more storage space and use my CPU a bit more.

Altering the schema or the way the string is generated (for example making them long or shorter) could completely reverse the results. So just like @Asya_Kamsky said, there is basically no way to make this test “fair”. It depends on why you are doing the comparison and why you are trying to optimise. And the results you would come up with would only be valid for your particular use case, schema and data set.

Just for the sake of it, I ran the exact same test but this time I started MongoDB with zstd compression algo instead of snappy to see the difference:

docker run --rm -d -p 27017:27017 -h $(hostname) --name mongo mongo:4.4.6 --replSet=test --wiredTigerCollectionBlockCompressor=zstd && sleep 4 && docker exec mongo mongo --eval "rs.initiate();"

And this time I got this completely unfair result (because it’s a stock Postgres VS MongoDB using the best compression algo):

MongoDB
Nb docs in MDB: 500000
Storage size  : 274.41152 MB
Indexes size  : 27.693056 MB

Postgres
Nb rows in PGS: 500000
Storage size  : 537 MB
Indexes size  : 24 MB

And this time MongoDB’s storage is 48.90% less than Postgres.

But again… It’s not proving anything.

Cheers,
Maxime.

harris · May 22, 2021, 5:13pm

Thank you for time and explanation.Yes you are right its not proving anything.
I appreciate it a lot!

steevej · May 23, 2021, 2:19am

I want to add my 2 cents.

Your data model is very important.

If you simply copy the SQL normalized way into your MongoDB data model, you deny yourself to have a better more efficient model. I don’t know how to explain it better that with an example of an accounting system.

SQL way

One table for a transaction (date, description, customer, …)
Another table for all the accounts and amounts that are credited and debited by the transaction. At least 2, for example bank account debited and account payable credited. Here in Quebec, a sale transaction touches 4 accounts, sales, account receivable, federal tax and provincial tax.

Mongo way

One collection for the transaction, as above
Put the debited/credited accounts and amounts in an array within the same collection.

Here, you save the space of the primary key (and its index) of the transaction that you would need in the second table to relate the credited/debited account of the transaction.

Processing wise, with SQL you need a transaction to update 2 tables and update 2 indexes (or 3 indexes if you want to quickly find transactions for a given account). In Mongo, all is done within a single document and only one index (or 2 indexes if you want to quickly find transactions for a given account).

Since I upgraded to MongoDB from SQL, I might miss some of the latest improvement made by SQL. My comment is based on the SQL schema of one of the most popular accounting system for small business.