4.2.3 server unresponsive after high IOWait%

Team,

We have issue with one of lower environment, deployed community version with 4.2.3. The server crashes when the IOWait% increases there is spikes in CPU as well. Once log file shows for every write insert takes roughly more than 30000ms to 40000ms. The read query for count with indexed scan takes roughly 20000ms. Sometime the write operation and read operation execution happens with 0ms. During spike in IOwait%, we see another warning message with server status is very slow. We don’t see much memory and CPU utilization but the IOWait% continuous to increase and server becomes unresponsive. We discussed with dev team to optimize the read and write operation, but i am not sure if its disk issue/slow queries/RAM addition.

Following are the iostat snapshot.

%iowait: 25  mongod process: 0.3 cpu% | 2.2 mem%
%iowait: 12  mongod: %cpu: 0.3 |  %ram: 2.2
%iowait 37.48  mongod: %cpu: .3 |  %ram: 2.2

Insert Query

2020-06-25T12:13:39.322+0000 I  COMMAND  [conn64] command ns.collecitionname command: insert { insert: "collectionname", ordered: true, $db: "xxx" } ninserted:1 keysInserted:19 numYields:0 reslen:45 locks:{ ParallelBatchWriterMode: { acquireCount: { r: 1 } }, ReplicationStateTransition: { acquireCount: { w: 1 } }, Global: { acquireCount: { w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { w: 1 } }, Mutex: { acquireCount: { r: 2 } } } flowControl:{ acquireCount: 1 } storage:{ data: { bytesRead: 15367, timeReadingMicros: 81871983 } } protocol:op_msg 81872ms

read Query

2020-06-24T14:02:52.424+0000 I  COMMAND  [conn25] command ns.collectionanme command: count { count: "collectionanme", query: { header.eventId: "da03d290-32ca-45ce-a3fb-0262b0ad96f2", _class: { $in: [ "com.charter.serviceactivation.milestone.model.Event" ] } }, limit: 1, $db: "MileStones" } planSummary: IXSCAN { header.eventId: 1 } keysExamined:0 docsExamined:0 numYields:0 queryHash:4DDDD3A7 planCacheKey:51072020 reslen:45 locks:{ ReplicationStateTransition: { acquireCount: { w: 1 } }, Global: { acquireCount: { r: 1 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } }, Mutex: { acquireCount: { r: 1 } } } storage:{ data: { bytesRead: 14345, timeReadingMicros: 30327760 } } protocol:op_msg 30327ms

2020-06-25T16:00:18.740+0000 I  COMMAND  [ftdc] serverStatus was very slow: { after basic: 0, after asserts: 0, after connections: 0, after electionMetrics: 0, after extra_info: 0, after flowControl: 0, after globalLock: 0, after locks: 0, after logicalSessionRecordCache: 0, after network: 0, after opLatencies: 0, after opReadConcernCounters: 0, after opcounters: 0, after opcountersRepl: 0, after oplogTruncation: 0, after repl: 0, after security: 0, after storageEngine: 0, after tcmalloc: 0, after trafficRecording: 0, after transactions: 0, after transportSecurity: 0, after twoPhaseCommitCoordinator: 0, after wiredTiger: 0, at end: 74739 } 

Highly appreciated if someone help on this issue.

Certainly looks like you have a bottleneck on your storage, this could be due to failure or underprovisioning of iops.

How are you deployed? bare metal, virtual machine, cloud provider, etc
What storage are you using? Local disk, network storage, SSD, spinning disk, any raid invovled etc.
Have you checked for failed or failing disks?