High IO After Upgrade to 3.6

I’ve recently upgraded mongo from 3.4 and 3.6 and noticed a significant increase in IOPS. With 3.4 the disk util was 50% and now with 3.6 it’s up to 80-90%.

It appears that ever since 3.6 there was a change that causes the journal to flush more to disk.
https://jira.mongodb.org/browse/SERVER-37233

The issue listed above states that it is a known issue and is in fact working as designed. However, the issue also claims that the increase in IOPS is isolated to primary members and what I’m seeing is that secondaries are affected as well. The issue also states that in case disk becomes loaded, mongo gives precedence to write ops over flushing to journal, effectively preserving the same expected performance before the upgrade.

However, I cannot really rely on that as I’m seeing the disk IO climbing to 90%. Is there a way to handle the increase in IO, reduce it somehow? Has anyone else also come across this issue?

Hi @vonschnappi_N_A welcome to the community!

You are correct that the higher disk activity is expected in MongoDB 3.6 compared to 3.4. This is due to improvements to make replication more reliable. Specifically, in 3.6 and newer, MongoDB only replicate journaled data. This makes replication more reliable since if the primary crashes while replicated writes are not journaled, it could end up with “holes” in its oplog.

Having said that, MongoDB 3.6 is not supported anymore per April 2021. I would encourage you to upgrade to the latest version (currently 5.0.9) for bugfixes and performance improvements.

Best regards
Kevin

Hi Kevin and thank you for your response.

Regarding 3.6 not being supported, your’e absolutely right. I’m actually in the process of reaching mongo latest and so I need to pass through all versions.

After some investigation I realized that I was looking at the wrong metric. I was looking at disk util% rather than IOPS. Indeed mongo 3.6 shows high IO but as I am using a provisioned IOPS disk, that shouldn’t be a problem. That the disk is 90% utilized is no cause for concern as long as the disk is able to serve requests.

I’m relying on this explanation:

%util: utilization
If this value is below 100%, then the device is completely idle for some of the time. It means there's definitely spare capacity.

However if this is at 100%, it doesn't necessarily mean that the device is saturated. As explained before, some devices require multiple outstanding I/O requests to deliver their full throughput. So a device which can handle (say) 8 concurrent requests could still show 100% utilisation when only 1 or 2 concurrent requests are being made to it all the time. In that case, it still has plenty more to give. You'll need to look at the the throughput (kB/s), the number of operations per second, and the queue depth and services times, to gauge how heavily it is being used.

Taken from this post.

Who ever comes across this post please note that the increase in IO is expected and as @kevinadi explained it’s for making replication and restore from crash more reliable and efficient. Make sure that the disks you give your mongo are provisioned IOPS and have enough of those to serve requests.

1 Like