Change streams with PSA configuration

Matt_H · October 3, 2023, 5:57pm

I have a Primary-Secondary-Arbiter configuration running with an application running a Change Stream. Change stream events are not delivered to the application when the Secondary node is down. The application is able to read/write to the Primary normally, but change stream events are not delivered until the Secondary node is restored. Is there any way to configure the PSA cluster to deliver change stream events when only Primary is active?

steevej · October 3, 2023, 6:11pm

This is not normal behaviour if your Change Stream is connected to the replica set.

What I suspect is that you are connecting directly to the secondary for the Change Stream rather than the replica set.

chris · October 3, 2023, 7:03pm

This is expected with PSA as you no longer can ‘majority’ commit which the change stream requires.

You can mitigate this as outlined in:

Matt_H · October 3, 2023, 7:37pm

Thanks Chris.

To be clear, the following manual reconfiguration sequence for the offline secondary node is required to keep the change stream going?

cfg = rs.conf();
cfg["members"][<array_index>]["votes"] = 0;
cfg["members"][<array_index>]["priority"] = 0;
rs.reconfig(cfg);

This is similar to how you would implement a manual failover for a two-node Primary-Secondary (no arbiter) replicaSet, correct?

Also, is there any server or client side configuration, such as enableMajorityReadConcern (from older MongoDB releases) that would allow the change stream listener to automatically recover from this condition?

chris · October 3, 2023, 8:21pm

Looks good to me and agrees with the document.

steevej · October 3, 2023, 9:48pm

I just learned something that I should have known.

Thanks

Matt_H · October 4, 2023, 2:45pm

Thank you Chris.

Just to be clear on one more point. There is no server or client side configuration, such as enableMajorityReadConcern (from older MongoDB releases) that would allow the change stream listener to automatically recover from this condition?

Best Regards,
Matt

Matt_H · October 13, 2023, 4:05pm

There are times when the documented ‘Mitigate Performance Issues with PSA Replica Set’ sequence referenced above hangs indefinitely. The sequence for removing an offline secondary is listed below for reference:

cfg = rs.conf();
cfg["members"][<array_index>]["votes"] = 0;
cfg["members"][<array_index>]["priority"] = 0;
rs.reconfig(cfg);

Reconfig() hangs forever in the following sequence:

node1 is primary
node1 goes down/looses power
node2 becomes primary after new PSA election (rs.hello().isWritablePrimary == true)
execute the rs.reconfig() sequence above to remove the offline member node1

rs.reconfig() hangs forever. I have to ctrl-c and rerun rs.reconfig(cfg, {force: true}). I would rather not force the reconfig.

Is the something missing from the sequence for the scenario presented above?

chris · October 13, 2023, 6:38pm

Yep.

I think that example is framed in the lagging scenario where it would eventually succeed.

Matt_H · October 16, 2023, 2:48pm

Is there a procedure that avoids the use of force: true?

chris · October 16, 2023, 2:59pm

While the other member is down ? No

Matt_H · October 17, 2023, 12:30pm

The documentation states the following to remove an unavailable or lagging data-bearing node:

To reduce the cache pressure and increased write traffic, set votes: 0 and priority: 0 for the node that is unavailable or lagging.

However, I’m observing the following:

node1: primary, node2: secondary; node2 goes down: I can use the documented procedure to remove the unavailable/lagging node.
node1:primary, node2: secondary; node1 does down; node2 becomes primary: the documented procedure is not working without requiring ‘force’, which is not mentioned in the documentation.

The recovery procedure only works for 50% of the cases where you lose a data-bearing node.