Problems with flowControl

Hello everyone!

Recently, we’ve been experiencing issues with one of our MongoDB instances that do not occur in others where we have similar tasks.

In all cases, we have a PSA (Primary - Secondary - Arbiter) architecture, running MongoDB 6.0.

We have common administrative tasks, such as backups, where we use fsyncLock, and other operations where the read node is temporarily shut down (for about 5-10 minutes).

During this period, flow control kicks in and starts delaying write requests—queries that usually take 50ms begin taking at least 700ms, and in some cases, exceed 210,000ms.

We understand how flow control works, but in our case, it doesn’t seem to make sense since everything is fine in the network, and the node is simply down.

In our tests, the impact of flow control is quite evident. Disabling flow control for routine tasks seems like a viable option for us, but we could face other situations, such as a cloud failure or other incidents where a read node becomes unavailable. This would cause a total failure of that read node, which is not a problem for us, but it would degrade writes on the primary, which does become a concern.

We would like to hear more opinions—does it make sense to disable flow control, or is there a safer option we are unaware of? Adding more read replicas does not solve the flow control issue.

Flow control is a protection mechanism, so disabling a safeguard seems risky to us.

We appreciate your support.

IMHO, instead of completely disabling flow control, I’d first try tuning it (flowControlTargetLagSeconds or flowControlThresholdLagPercentage) and see if that balances things out. Completely turning it off is a bit of a risky approach, but might work for your use case if you can tolerate the risks… i.e. If your secondary does fall behind significantly for unexpected reasons (e.g., network issues, cloud provider hiccups), writes on the primary could continue at full speed, and if the secondary can’t catch up, it could drop out of sync.

Hello Michael, thanks for your response!

Your answer seems very good to me. At the moment, we don’t have any replication or network lag, so that part is fine.

However, I ran a test today, and from what I understand, Flow Control acts based on writeMajorityCount: rs.status().writeMajorityCount

In a PSA architecture, the writeMajorityCount would be 2, and since one of the nodes is down, Flow Control inevitably kicks in.

So, based on my test, if our architecture were PSS (Primary - Secondary - Secondary), even with one node offline, Flow Control would not act, correct?

And in the case of a PSA architecture, we would need two more read replicas:

  • Primary
  • Secondary
  • Arbiter
  • Secondary
  • Secondary

So the writeMajorityCount would be 3, preventing Flow Control from activating when one node is down, correct?