Disk Utilization pegged at near 100%

jarrett · June 21, 2021, 4:52pm

I’m currently running the following setup and am baffled with what is wrong.

3 x r5.2xlarge EC2 instances (8 CPU / 64GB Memory)

Each server has the following 3 drives setup.

Data - 150GB - EBS GP3 w/ 3000 IOPS (125MB/s)
Journal - 10GB - EBS GP3 w/ 3000 IOPS (125MB/s)
Logs - 10GB - EBS GP3 w/ 3000 IOPS (125MB/s)

I’m running into an issue when I start approaching like 1k Inserts per second. I see the following issues happen during my load testing.

Write Tickets plummet to 7 of 128
Disk Util % only on Journal drive is pegged at like 90-100%
Operation Execution time on Writes jump to like 150-200ms
IOWAIT is like 5-6%
Normalized CPU is 60-70% max

Now my first instinct is the drive is maxed out, but what baffles me is that my IOPS are only at 1200/s and I have up to 3000/s. I’ve also tried using provisioned IOPS and it didn’t change anything.

What am I missing here?

Denis_Bartashevich · November 2, 2021, 4:23pm

I have the same issue. Did you found anything regarding this?

kevinadi · November 3, 2021, 2:47am

Hi @Denis_Bartashevich welcome to the community!

Are you seeing exactly the same thing, e.g. provisioned in AWS, write tickets numbers dropping, disk util %, iowait, etc.?

Generally if the disk appear to be maxed out, you might want to check if you’re using a burstable EBS type, which I think is the default setting in EC2. I have seen burstable EBS exhibit this exact behaviour: the disks seems to be underutilized but all signs point to overworked disks.

The typical solution is to provision a larger disk (since they usually have a larger burst credit), or move to a Provisioned IOPS disk with a setting that matches your needs.

See Amazon EBS volume types specifically the section titled I/O Credits and burst performance for more details about their behaviours.

Best regards
Kevin

jarrett · November 3, 2021, 4:03am

@Denis_Bartashevich we reduced indexes, scaled the servers drastically and adjusted the write concern for some higher volume writes that were not as important. We also split the drives for journal, logs and data. I never was able to come to a clear conclusion as to what exactly the cause was.

@kevinadi, as I wrote… we had tested provisioned IOPS with no change to the problem.

Denis_Bartashevich · November 8, 2021, 10:06am

Hi @kevinadi , we are seeing disk util % and iowait. Didn’t check write tickets number.

We don’t see major performance issue, just the disk util 100% and iowait (5%).

For sure, each 60s (flush), disk util rise to 100% from 30% (secondaries) and from 90% to 100% (primary) during about 60s.

We are currently using m5.2xlarge, 80GB IO1 3k IOPS (50% used) with 3k ops oplog operation. Secondaries are used for replication only.

So far we are using 1 volume GP3 for OS and 1 volume IO1 for MongoDB, we will try creating 3 volume for MongoDB like @jarrett mentioned.

@jarrett what do you mean by scaled the servers drastically? in our case I don’t see how upscale (adding more CPU/RAM) can help in anyway. As current CPU usage is 25%

Denis_Bartashevich · November 8, 2021, 11:27am

I have added extra volume exclusive to journal and we see, 100% disk usage with 500 IOPS. So probably that’s where our problem is.

kevinadi · November 8, 2021, 11:00pm

Hi @Denis_Bartashevich

Glad you found the root cause. Yes that graph looks like the disk is struggling to keep up with the work assigned to it. If you can try to provision a faster disk, you might see some improvement.

Best regards,
Kevin