Avoiding disk fragmentation

Arthur_Hoornweg · August 5, 2022, 9:40am

Hi everybody,

I’m working on a MongoDB database that will contain a few dozen collections, each containing time-based sensor data growing in 1 Hz interval. These collections will become very big (the order of magnitude is ~10 gb each) and it is mandatory that query performance is good.

Since each collection grows simultaneously in a 1 Hz rhythm, my fear is that the files on the hard drive will become heavily “intertwined” (fragmented) and that query performance will suffer due to excessive head movement of the drive.

My question:

What “chunk size” (granularity) does MongoDB use when it needs to enlarge a collection’s file on disk ?
Is it possible to configure that setting? I’d like to be able to set it to a very high value to reduce file fragmentation.

Kind regards,
Arhur Hoornweg

Aasawari · August 30, 2022, 9:11am

Hi @Arthur_Hoornweg and welcome to the community!!

As far as I know, the “chunk size” is not a configurable feature in MongoDB. However, disk fragmentation may or may not be the primary cause of performance issues you are thinking about.

For example, using an SSD might lead to a better performance when compared to a spinning disc. Also, the right hardware configuration(RAM, CPU) according to the workload would also have sizeable impacts to performance.

In my opinion, before going deep into disk fragmentation optimisation, it’s best to ensure that your deployment follow the settings recommended in the production notes and the operations checklist.

However, while your understanding and concern regarding the fragmentation is valid, it is also important to note that, aside from growing collections and indexes, other parts of the system may also grow and contribute to fragmentation as time goes on (e.g. MongoDB logs, system logs, and other files outside of MongoDB’s control). Also if you’re using SSD, their wear levelling algorithm may also create fragmentation, in order to extend the life of the SSD.

In general, query performance would be impacted by the following considerations:

The schema design for the application.
Indexed fields if specified.
If you have aggregation queries involving a large number of documents and calculations.
In terms of storage, how compressible are the documents as it might impact filesystem cache performance.
Hardware specifications like the amount of RAM you have, the CPU, disk type, etc. would also be important.

However, without a complete understanding of your use case, you may be interested in exploring the use of time-series collection available in MongoDB version 5.0 onwards.

If you need further information, could you help me with a few details like:

The MongoDB version you are on.
Steps or method of your analysis that concludes that fragmentation would be a concern in the future.

Thanks
Aasawari