EC2 CPU saturation 100%

We have a three node replicaset (1 master, 2 secondary) installed on a set of AWS EC2 instances (m5.xlarge).

Storage:

  • /
    • 8gb (gb2, 100 iops) Encrypted 1.5 gb available
  • /data
    • 200gb (gb2, 600 iops) Encrypted 106 gb available
  • /journal
    • 25gb (io1, 200 iops) Encrypted 24 gb available
  • /log
    • 25 (io1, 250 iops) Encrypted 24 gb available

Last week we started getting alerts from our monitoring system. In further investigating the issue we are seeing that the EC2’s are hitting 100% CPU Saturation.

Most of the spike is CPU I/O wait. When this happens we start getting 504’s from our application as Mongo is the main db that holds the data.

We are not seeing any mongo errors in the logs leading up to this issue and it is very sporadic. It has happened on all three of our nodes.

During the issue we see heartbeat errors:

Error in heartbeat to XXX:27017, response status: NetworkInterfaceExceededTimeLimit: Remote command timed out while waiting to get a connection from the pool, took 12481ms, timeout was set to 10000ms

We have had our networking team on and they are not seeing any issues with the network dropping any packets or the replica’s having issues connecting to each other

We have a script that is watching the process and restarts it if the process is not running and added a new condition to restart the process if the CPU is spiking to 100%.

Has anyone else experienced this issue? Is there some troubleshooting steps that we can do to determine ways to fix this issue?

Are the Storage classes/sizes above appropriate for Mongodb?