Change streams with PSA configuration

I have a Primary-Secondary-Arbiter configuration running with an application running a Change Stream. Change stream events are not delivered to the application when the Secondary node is down. The application is able to read/write to the Primary normally, but change stream events are not delivered until the Secondary node is restored. Is there any way to configure the PSA cluster to deliver change stream events when only Primary is active?

This is not normal behaviour if your Change Stream is connected to the replica set.

What I suspect is that you are connecting directly to the secondary for the Change Stream rather than the replica set.

This is expected with PSA as you no longer can ‘majority’ commit which the change stream requires.

You can mitigate this as outlined in:

1 Like

Thanks Chris.

To be clear, the following manual reconfiguration sequence for the offline secondary node is required to keep the change stream going?

cfg = rs.conf();
cfg["members"][<array_index>]["votes"] = 0;
cfg["members"][<array_index>]["priority"] = 0;
rs.reconfig(cfg);

This is similar to how you would implement a manual failover for a two-node Primary-Secondary (no arbiter) replicaSet, correct?

Also, is there any server or client side configuration, such as enableMajorityReadConcern (from older MongoDB releases) that would allow the change stream listener to automatically recover from this condition?

Looks good to me and agrees with the document.

I just learned something that I should have known.

Thanks

1 Like

Thank you Chris.

Just to be clear on one more point. There is no server or client side configuration, such as enableMajorityReadConcern (from older MongoDB releases) that would allow the change stream listener to automatically recover from this condition?

Best Regards,
Matt

There are times when the documented ‘Mitigate Performance Issues with PSA Replica Set’ sequence referenced above hangs indefinitely. The sequence for removing an offline secondary is listed below for reference:

cfg = rs.conf();
cfg["members"][<array_index>]["votes"] = 0;
cfg["members"][<array_index>]["priority"] = 0;
rs.reconfig(cfg);

Reconfig() hangs forever in the following sequence:

  1. node1 is primary
  2. node1 goes down/looses power
  3. node2 becomes primary after new PSA election (rs.hello().isWritablePrimary == true)
  4. execute the rs.reconfig() sequence above to remove the offline member node1

rs.reconfig() hangs forever. I have to ctrl-c and rerun rs.reconfig(cfg, {force: true}). I would rather not force the reconfig.

Is the something missing from the sequence for the scenario presented above?

Yep.

I think that example is framed in the lagging scenario where it would eventually succeed.

Is there a procedure that avoids the use of force: true?

While the other member is down ? No

The documentation states the following to remove an unavailable or lagging data-bearing node:

To reduce the cache pressure and increased write traffic, set votes: 0 and priority: 0 for the node that is unavailable or lagging.

However, I’m observing the following:

  • node1: primary, node2: secondary; node2 goes down: I can use the documented procedure to remove the unavailable/lagging node.
  • node1:primary, node2: secondary; node1 does down; node2 becomes primary: the documented procedure is not working without requiring ‘force’, which is not mentioned in the documentation.

The recovery procedure only works for 50% of the cases where you lose a data-bearing node.

1 Like