DISK UTIL jumps to 100% after adding just a marginal amout of data

Zarif_Alimov · April 5, 2022, 5:17pm

I am running a query that selects a large data set. The query, among other things, includes a date range. To test it, I start a smaller date range and slowly increase it to include an ever larger amount of records. As I run my queries, I am looking at the real time monitoring page for my Atlas cluster that looks like this:

The breakdown goes like this:
Feb 1 to Feb 8 - Query takes 2 seconds. (returns 25k records)
Feb 1 to Feb 15 - Query takes 4 seconds. (returns 35k records)
Feb 1 to Feb 18 - Query doesn’t even finish and times out. (should return about 40k records)
And what happens to the DISK UTIL on that 3rd query is it jumps to 99% and stays there for the whole duration of the query (5 min.), while the first 2 queries barely even move it. I know for a fact that the data is evenly distributed throughout the month, so there is no way those 3 extra days would include a humongous amount of data.

Why such a sudden and drastic change to DISK UTIL (This is Disk I/O I assume)? What sends it over the edge? And is there a way to monitor it on the application level to scale it back somehow if I see it jump?

Jason_Tran · April 6, 2022, 3:05am

Hi @Zarif_Alimov,

Why such a sudden and drastic change to DISK UTIL (This is Disk I/O I assume)? What sends it over the edge?

Based off the title and the examples you have provided I assume your main concern here is the sudden and drastic change in the Disk Util metric for the additional “marginal” amount of data. However, please correct me if I am wrong here. This would be hard to determine with the information currently at hand but there could be many reasons for this. I have provided more details below that may help you determine or narrow down what the issue may be.

In saying so, the Disk Util % metric is defined in Atlas as:

The percentage of time during which requests are being issued to and serviced by the partition. This includes requests from any process, not just MongoDB processes.

You can see definition of each metric within the metrics page of your cluster(s) and select the info icon as detailed here in this post.

As noted in the Use Case column on the same Available Charts page for the Util % metric:

Monitor whether utilization is high. Determine whether to increase the provisioned IOPS or upgrade the cluster.

You can check the Disk IOPS metric along with the Disk Util % metric to see if there is any correlation there to see if you are hitting the IOPS limit configured for your cluster. I would also like to additionally note that some storage configurations have the ability to utilise burst credits which allow for temporary increase in the IOPS for a cluster. I would also recommend going over the overall Fix IOPS issues to see if this helps with this issue as it contains further details regarding:

However, it is important to note that monitoring the Disk IOPS metric alone may not be sufficient to conclude that it is the issue. Instead, I would recommend reviewing multiple metrics along with the Disk IOPS metric to narrow down or come to a more accurate conclusion. There are some scenarios in which IOPS do not get exhausted but spikes in Disk Util may be present due to more expensive operations that cause higher I/O wait (visible within the System CPU metrics) or increased Disk Latency.

Additionally, the I/O request size should also be considered here. As another example, a singular operation to insert a large document may only utilise minimal Disk IOPS but could take some time to complete which could then lead to increased Util %.

And is there a way to monitor it on the application level to scale it back somehow if I see it jump?

I am not too sure about the application level monitoring for this but from the Atlas end, you can configure alerts to be sent when a certain criteria is met. Please see the Configure Alert Settings page for more information about this.

Hope this helps.

Regards,
Jason