High CPU usage on Mongo server, but Mongo seems to be idle

Sasa_Trifunovic · April 16, 2021, 2:36pm

Setup:

We are running MongoDB in version 4.2.13. Replica set, primary and two replicas. Servers have 4 CPUs and 16 gb of RAM (m5.xlarge instance with gp2 disks) and are dedicated only to Mongo. Primary is mostly used for writing to, while our reads are mainly performed from replicas.
We are running Mongo with default config and transactionLifetimeLimitSeconds set to 900.

Problem:

During the load tests, we are regularly encountering situations where primary gets stuck. Load average becomes ~9 ) and by watching mongotop and mongostat it seems Mongo isn’t performing any (significant) db operation at that time.
We couldn’t find any hint even with the profiler turned on (profile level:1, logging slower than 40ms).
Current op also didn’t reveal to us any obvious abnormalities as well as looking into mongo log.

Slice of our mongostat , mongtop outputs at the time of very high cpu load:

Load average during that time:

 13:24:00 up 13 days, 23:04,  3 users,  load average: 10.73, 9.43, 7.35
 13:24:01 up 13 days, 23:04,  3 users,  load average: 10.73, 9.43, 7.35
 13:24:02 up 13 days, 23:04,  3 users,  load average: 10.73, 9.43, 7.35
 13:24:03 up 13 days, 23:04,  3 users,  load average: 10.73, 9.43, 7.35
 13:24:04 up 13 days, 23:04,  3 users,  load average: 10.73, 9.43, 7.35
 13:24:05 up 13 days, 23:04,  3 users,  load average: 10.59, 9.42, 7.36
 13:24:06 up 13 days, 23:04,  3 users,  load average: 10.59, 9.42, 7.36
 13:24:07 up 13 days, 23:04,  3 users,  load average: 10.59, 9.42, 7.36
 13:24:08 up 13 days, 23:04,  3 users,  load average: 10.59, 9.42, 7.36
 13:24:09 up 13 days, 23:04,  3 users,  load average: 10.59, 9.42, 7.36
 13:24:10 up 13 days, 23:04,  3 users,  load average: 10.54, 9.43, 7.37
 13:24:11 up 13 days, 23:04,  3 users,  load average: 10.54, 9.43, 7.37
 13:24:12 up 13 days, 23:04,  3 users,  load average: 10.54, 9.43, 7.37
 13:24:13 up 13 days, 23:04,  3 users,  load average: 10.54, 9.43, 7.37
 13:24:14 up 13 days, 23:04,  3 users,  load average: 10.54, 9.43, 7.37
 13:24:15 up 13 days, 23:04,  3 users,  load average: 9.86, 9.31, 7.34
 13:24:16 up 13 days, 23:04,  3 users,  load average: 9.86, 9.31, 7.34
 13:24:17 up 13 days, 23:04,  3 users,  load average: 9.86, 9.31, 7.34
 13:24:18 up 13 days, 23:04,  3 users,  load average: 9.86, 9.31, 7.34
 13:24:19 up 13 days, 23:04,  3 users,  load average: 9.86, 9.31, 7.34
 13:24:20 up 13 days, 23:04,  3 users,  load average: 9.87, 9.32, 7.36
 13:24:21 up 13 days, 23:04,  3 users,  load average: 9.87, 9.32, 7.36
 13:24:22 up 13 days, 23:04,  3 users,  load average: 9.87, 9.32, 7.36
 13:24:23 up 13 days, 23:04,  3 users,  load average: 9.87, 9.32, 7.36
 13:24:24 up 13 days, 23:04,  3 users,  load average: 9.87, 9.32, 7.36
 13:24:25 up 13 days, 23:04,  3 users,  load average: 9.96, 9.35, 7.38
 13:24:26 up 13 days, 23:04,  3 users,  load average: 9.96, 9.35, 7.38
 13:24:27 up 13 days, 23:04,  3 users,  load average: 9.96, 9.35, 7.38
 13:24:28 up 13 days, 23:04,  3 users,  load average: 9.96, 9.35, 7.38
 13:24:29 up 13 days, 23:04,  3 users,  load average: 9.96, 9.35, 7.38
 13:24:30 up 13 days, 23:04,  3 users,  load average: 10.68, 9.51, 7.44
 13:24:31 up 13 days, 23:04,  3 users,  load average: 10.68, 9.51, 7.44
 13:24:32 up 13 days, 23:04,  3 users,  load average: 10.68, 9.51, 7.44
 13:24:33 up 13 days, 23:04,  3 users,  load average: 10.68, 9.51, 7.44
 13:24:34 up 13 days, 23:04,  3 users,  load average: 10.68, 9.51, 7.44
 13:24:35 up 13 days, 23:04,  3 users,  load average: 10.47, 9.48, 7.44
 13:24:36 up 13 days, 23:04,  3 users,  load average: 10.47, 9.48, 7.44
 13:24:37 up 13 days, 23:04,  3 users,  load average: 10.47, 9.48, 7.44
 13:24:38 up 13 days, 23:04,  3 users,  load average: 10.47, 9.48, 7.44
 13:24:39 up 13 days, 23:04,  3 users,  load average: 10.47, 9.48, 7.44

If needed I can provide anything you find relevant from db.serverStatus for that time period.

Any clue in determining the cause of this issue would be very much appreciated.

kevinadi · April 29, 2021, 1:55am

Hi @Sasa_Trifunovic welcome to the community!

At a glance, I see from your mongostat output that your dirty% numbers are more than 20%. At this level, the server will stop processing incoming commands and will try extremely hard to flush those dirty pages to disk.

The key values of dirty% are:

At >5%, it will try very hard to flush dirty pages, and you should see some slowdowns in responsiveness.
At >20%, it will stop processing incoming work to flush dirty pages, and you should see the server stalling.

So although you don’t see any operation in progress in mongostat, I suspect you’ll be seeing the disk fully utilized at this point, since the server is busy processing the backlog of work.

As to what caused this, I believe there are two main things:

It’s a sign that you’re putting more work than what the hardware can handle. Provisioning a larger hardware should mitigate this.
Setting transactionLifetimeLimitSeconds back to its default value should also mitigate this. Default value is 30 seconds, and setting this to 900 (30 times its recommended value) will incur increased memory usage during heavy transaction usage.

Best regards
Kevin

Sasa_Trifunovic · April 29, 2021, 12:33pm

Hi Kevin,

First and foremost thank you for helping and welcoming me.

We are currently running load tests, and as it happened before, Mongo started stalling, but our gp2 disks are not utilized that much. It seems that they are being utilized in (few) bursts. We also tried watching io with iotop directly which confirmed metrics we have on Prometheus, meaning io was there in bursts and by aws metrics disks were underutilized.

We tried bumping our instance to m5.2xlarge (double the previous one) but after the first successful test, it also started stalling when we ran the second one.

We also tried limiting our transactions to 60 seconds but that also did not help.

Yesterday we tried lowering write concern in an effort to eliminate replication as an issue and bulk inserting our records in an unordered way instead of ordered but neither of those showed any improvement.

Again thanks for pointing out the problem with the percentage of dirty pages and in case you have more insights we would appreciate them very much.

Best,
Sasha

kevinadi · April 30, 2021, 12:28am

Hi @Sasa_Trifunovic

I’ve seen some problem with burstable instances in AWS. This is how things usually proceed:

There is a sudden spike of incoming work, which is unusual for the cluster.
Performance keeps up for some time, but the incoming work keep coming in.
The disk/cpu ran out of burst credit due to the sustained levels of incoming work.
AWS restricts the disk/cpu performance since the instance ran out of burst credit.
The node stay in its “baseline” performance. At this time, everything is stalling, the node looks underutilized, and very little of the work backlog can be processed since node performance was restricted by AWS.
The incoming work stopped coming in, the node slowly catches up with work and accumulates burst credit again, and eventually things go back to normal.
A new set of incoming work comes in again, and we’re back at (1).

Since you’re using gp2 disks, I think you’re running into this burst credit issue. This is explained in the Amazon EBS volume types under gp2.

Note that in most cases, this burst credit system is actually beneficial. This allows you to process sudden spike in work without needing to provision a larger hardware for a once-in-a-while events. However the idea is that the work is a spike, and not a sustained load.

If you’re expecting the normal state of the cluster to be as busy as your load tests consistently, then using a provisioned iops disks is probably a better choice vs. using a gp2 instance.

Best regards,
Kevin

Guglielmo_Fanini · November 2, 2021, 6:06pm

with this version of mongostat, on linux, how can I show the “dirty %” field ?

mongostat --version
mongostat version: 100.5.0
git version: 460c7e26f65c4ce86a0b99c46a559dccaba3a07d
Go version: go1.16.3
os: linux
arch: amd64
compiler: gc

insert query update delete getmore command flushes mapped vsize res faults qrw arw net_in net_out conn time
*0 *0 1 1094 0 2|0 0 0B 1.54G 111M 0 0|0 0|0 2.55m 18.3k 16 Nov 2 19:05:57.139