Setup:
We are running MongoDB in version 4.2.13. Replica set, primary and two replicas. Servers have 4 CPUs and 16 gb of RAM (m5.xlarge instance with gp2 disks) and are dedicated only to Mongo. Primary is mostly used for writing to, while our reads are mainly performed from replicas.
We are running Mongo with default config and transactionLifetimeLimitSeconds set to 900.
Problem:
During the load tests, we are regularly encountering situations where primary gets stuck. Load average becomes ~9 ) and by watching mongotop and mongostat it seems Mongo isn’t performing any (significant) db operation at that time.
We couldn’t find any hint even with the profiler turned on (profile level:1, logging slower than 40ms).
Current op also didn’t reveal to us any obvious abnormalities as well as looking into mongo log.
Slice of our mongostat , mongtop outputs at the time of very high cpu load:
Load average during that time:
13:24:00 up 13 days, 23:04, 3 users, load average: 10.73, 9.43, 7.35
13:24:01 up 13 days, 23:04, 3 users, load average: 10.73, 9.43, 7.35
13:24:02 up 13 days, 23:04, 3 users, load average: 10.73, 9.43, 7.35
13:24:03 up 13 days, 23:04, 3 users, load average: 10.73, 9.43, 7.35
13:24:04 up 13 days, 23:04, 3 users, load average: 10.73, 9.43, 7.35
13:24:05 up 13 days, 23:04, 3 users, load average: 10.59, 9.42, 7.36
13:24:06 up 13 days, 23:04, 3 users, load average: 10.59, 9.42, 7.36
13:24:07 up 13 days, 23:04, 3 users, load average: 10.59, 9.42, 7.36
13:24:08 up 13 days, 23:04, 3 users, load average: 10.59, 9.42, 7.36
13:24:09 up 13 days, 23:04, 3 users, load average: 10.59, 9.42, 7.36
13:24:10 up 13 days, 23:04, 3 users, load average: 10.54, 9.43, 7.37
13:24:11 up 13 days, 23:04, 3 users, load average: 10.54, 9.43, 7.37
13:24:12 up 13 days, 23:04, 3 users, load average: 10.54, 9.43, 7.37
13:24:13 up 13 days, 23:04, 3 users, load average: 10.54, 9.43, 7.37
13:24:14 up 13 days, 23:04, 3 users, load average: 10.54, 9.43, 7.37
13:24:15 up 13 days, 23:04, 3 users, load average: 9.86, 9.31, 7.34
13:24:16 up 13 days, 23:04, 3 users, load average: 9.86, 9.31, 7.34
13:24:17 up 13 days, 23:04, 3 users, load average: 9.86, 9.31, 7.34
13:24:18 up 13 days, 23:04, 3 users, load average: 9.86, 9.31, 7.34
13:24:19 up 13 days, 23:04, 3 users, load average: 9.86, 9.31, 7.34
13:24:20 up 13 days, 23:04, 3 users, load average: 9.87, 9.32, 7.36
13:24:21 up 13 days, 23:04, 3 users, load average: 9.87, 9.32, 7.36
13:24:22 up 13 days, 23:04, 3 users, load average: 9.87, 9.32, 7.36
13:24:23 up 13 days, 23:04, 3 users, load average: 9.87, 9.32, 7.36
13:24:24 up 13 days, 23:04, 3 users, load average: 9.87, 9.32, 7.36
13:24:25 up 13 days, 23:04, 3 users, load average: 9.96, 9.35, 7.38
13:24:26 up 13 days, 23:04, 3 users, load average: 9.96, 9.35, 7.38
13:24:27 up 13 days, 23:04, 3 users, load average: 9.96, 9.35, 7.38
13:24:28 up 13 days, 23:04, 3 users, load average: 9.96, 9.35, 7.38
13:24:29 up 13 days, 23:04, 3 users, load average: 9.96, 9.35, 7.38
13:24:30 up 13 days, 23:04, 3 users, load average: 10.68, 9.51, 7.44
13:24:31 up 13 days, 23:04, 3 users, load average: 10.68, 9.51, 7.44
13:24:32 up 13 days, 23:04, 3 users, load average: 10.68, 9.51, 7.44
13:24:33 up 13 days, 23:04, 3 users, load average: 10.68, 9.51, 7.44
13:24:34 up 13 days, 23:04, 3 users, load average: 10.68, 9.51, 7.44
13:24:35 up 13 days, 23:04, 3 users, load average: 10.47, 9.48, 7.44
13:24:36 up 13 days, 23:04, 3 users, load average: 10.47, 9.48, 7.44
13:24:37 up 13 days, 23:04, 3 users, load average: 10.47, 9.48, 7.44
13:24:38 up 13 days, 23:04, 3 users, load average: 10.47, 9.48, 7.44
13:24:39 up 13 days, 23:04, 3 users, load average: 10.47, 9.48, 7.44
If needed I can provide anything you find relevant from db.serverStatus for that time period.
Any clue in determining the cause of this issue would be very much appreciated.