Do MongoDB deserialize the BSON from file before indexing?

sakdkjkj_jjsdjds · March 10, 2021, 7:18pm

I want to understand how MongoDB database phases is when it index the data step by step.

Do MongoDB save the BSON serialized data first as files and later deserialize that data when needed and index it as B-tree in memory for search?

Can someone please explain step by step?

Pavel_Duchovny · March 11, 2021, 6:11am

Hi @sakdkjkj_jjsdjds,

Associated indexes to the fields that are affected by a write operation will be updated on the fly.

The transformation between files and memory structure is managed by wired tiger engine and is getting persistent by a periodic checkpoint to files from cache.

For this reason maintain of many indexes may slow writes and require more memory for each write.

The tradeoff between indexes for queries and too many indexes is crucial in performance tuning

Thanks
Pavel

sakdkjkj_jjsdjds · March 11, 2021, 7:22am

Hi @Pavel_Duchovny

Thank you very much for taking the time to explain to me the steps!

Is it safe to assume that wired tiger cache all the BSON documents when MongoDB is started to initialize or do MonogoDB cache only document with indexes?

and changes made to the cached data effect the BSON documents later on disk to be persistent?

Pavel_Duchovny · March 11, 2021, 7:50am

Hi @sakdkjkj_jjsdjds,

Any object that needs to be altered is loaded into cache, also documents that are not indexed.

When you just read data it is also loaded. However if data is not accessed it is not loaded to memory

The index is a different file so it has its own pages to be loaded.

The data is written in a different format than pure bson. But eventually in memory they are desirialized to bsons.

Thanks
Pavel

sakdkjkj_jjsdjds · March 11, 2021, 7:56am

@Pavel_Duchovny I see that give a lot of sense

But that means the first data accessed ( not cached yet ) will be a slower operation because BSON have to iterate the collection to find it?

Do MongoDB cache only the document or the whole collection ( all docs inside) if the data need to be accessed?

And is the cached data freed again after some time if there happens to be no activity for that collection?

Pavel_Duchovny · March 19, 2021, 11:11am

Yes.

If the working set is yet in memory it will have to be fetched from disk.

The ideal performance for your primary is if all your working set can fit into 80% of your Wired Tiger cache. If that is not possible due to size limits consider trying to fit at least the indexes in those 80% as this will mean the disk Access be minimal and direct.

Example a 32GB server will be by default with 16GB Wired Tiger cache , 80% of the cache will be ~13GB …

MongoDB caches pages of WiredTiger the amount if documents that could fit there depands on the size of documents.