Issue in Replication

Dilip_D · March 23, 2023, 2:29pm

Hi Guys

We have mongo 3.6.4 running in PSA architecture in centos. All three nodes are running separately on physical severs. Today the secondary nodes was suddenly crashed and was not available in the Replicaset. When I checked rs.status(), the Secondary node was with status no route to Host. When this happened, even though primary node was up and running, my entire application became slow in processing the transactions.

My application basically reads message from Rabbitmq and my application internally has multiple modules which communicate via ActiveMQ. When the Secondary node gone down, my app throughput reduced from processing 500 Messages per second to 50, 60 or 100 Message per second. Even some time totally idel. This resulted in Queue pileup in both RabbitMQ and in ActiveMQ.

After Restarting all the application nodes, Rabbitmq, Activemq, Nothing was helping in returning to my application actual throughput. After all the try, I just randomly thought and removed the Secondary node from Replication and suddenly my app started processing the messages to 450, 480 500 messages per second.

Question is : How is Non-availability of the Secondary node impacted my application performance even though Primary node was up and running and was fully healthy. This today’s behaviour was totally agaist the basic understanding on the mongo replication.

Is there anything that I should be looking at or I forgot to look at so that this kind of issue doesn’t happen in the future ???

Aasawari · March 31, 2023, 4:11am

Hi @Dilip_D and welcome to the MongoDB community forum!!

The arbiters are useful to allow a replica set to have a primary when the secondary goes down. Although this deployment is supported, there are some caveats with regard to operations and maintenance.
The recommended way here would be to have a PSS architecture, with no arbiters unless compulsory for the deployment.

Also, the version you are using, is quite old(Almost 6 years ago), I would recommend you to upgrade to the latest version with bug fixes and new features added.

To add here, the error No Route to Host could be one of the reason for networking issues in the deployment.
Could you help me understand how the replica set was configured or how the secondaries were added?

As mentioned in the MongoDB release notes documentation:

Starting in MongoDB 3.6, MongoDB enables support for “majority” read concern by default.

What we suspect in your case, when the secondary node goes down, the majority commit point (information about the latest version of the data in all data-bearing nodes) cannot move forward due to the unavailable secondary. Consequently, the primary needs to keep old versions of data as long as the secondary stays offline. This will lead to a cache full scenario, where WiredTiger will spill it’s cache content to disk in the form of WiredTigerLAS.wt file in the dbpath.

Can you confirm is you can see larger size of the WiredTigerLAS.wt file? Also, can you try disabling the majority read concern and see the similar issue.
The server ticket here mentions the similar behaviour in the past releases.

Let us know if you have any other concerns.

Best regards
Aasawari