Performance issues on mongo 4.2, 4.4 and 5.0

Hello,

We have 10 shards of 3 members (pri+sec+sec hidden) hosted on GCP’s instances. The first shard is on SSD, the other 9 on regular disks.

We’ve been running mongod 4.0 for quite some time without any issues but then decided to upgrade to mongodb 5.0. And ever since, we have big performance issues.

Randomly, but quite often, a shard will slow down and the slow queries logged on this instance will go in the thousands for 5/10/30mn/1h.

We managed to establish a relation between these slow downs and the balancer. Usually the slow down period correspond to a chunk being moved from/to the server. These transfers usually take a few seconds, but sometime last for 40mn.

When we stopped the balancer, the issue was no longer occurring.

But as it’s not a long term solution, we decided to rollback to mongodb 4.4 but ended up having the same behavior.

We then rolled back to mongodb 4.2 and noticed an amelioration. We still had some slow queries, but not as often and not for as long. However last week, we’ve had nearly a full day of slow queries on all shards, and once again, stopping the balancer stopped the issue.

We also noticed an increase of logs regarding the WiredTiger’s checkpoints with an increase in delay going up to 200/250s

But we’re now kind of stuck and don’t know what to do / what to monitor.

In parallel to all these upgrades/downgrades we tried some upgrades/downgrades of the java driver as we upgraded from 3.9 to 4.5, but then noticed 4.3+ versions of the driver had known performance issues we rolled back to the version 4.2 of the driver, but with no luck.

Our only option at the moment is to rollback to the 4.0 to see if we manage to fallback to our initial calm state, but because of the 4.4 upgrade, downgrade to 4.0 does not seem possible unless we rebuild all our nodes from scratch…

Is there anything we can provide to help find & fix the issue?

Random notes:

  • there are no increase in the number of queries during the incident
  • some of our software uses a readpref at secondaryPreferred

Some other thing we noticed (but it might be a consequence), as far as we can see, every time a node struggle, there is an increase in the number of active processes and tcp connections