MongoDB Replica Lag Doesn't Catchup

Hi,
We running on MongoDB Community 4.0.12. We are trying to create a new secondary node in existing 3 Node Cluster where we have 1 Primary and 2 Secondary. Our DB Size is 4TB.

Steps we took to Add new secondary node:

  1. Data Drive Snapshot of one of existing Secondary.
  2. Attached it to newly created secondary.
  3. Added Node to primary using rs.add().

Initial, lag after adding new node to primary was 0.76 hours (and new node state is Secondary)but lag keeps growing and grows till 15 hours and finally changes from Secondary to Recovering. Oplog Size on all Nodes ar 150 GB which is good hold data of 3.7 hours. We are running theses server on EC2 machine with io2 Storage and PIOPs 50K in different Availability Zone.

We have tried this 4 times but no luck. We are not able to understand why Secondary node not able to catch up just lag of less than hour? Is there any better way to create and attach new secondary to existing cluster? Is there any replication or any other parameter which can help to fix this?
Thanks in advance.

Welcome to The MongoDB Community Forums @Prasun_Pandey ! :wave:

I think the most likely cause is that the secondary cannot keep up with the write workload of primary node. The default write concern of w:1 requires that only the primary replica set member acknowledge the write before returning write concern acknowledgment, which means that it’s up to the secondary to keep up with the primary. If it cannot, it will get left behind and eventually fall off the oplog.

You can check if your secondary has an identical hardware specification to the primary, which it should as in case of primary goes offline due to maintainance or other issues, this could be your next primary node. If it has similar hardware specifications and still cannot catch up then you can check the network. It is possible that the network link between them is too slow for the workload. Some other possible causes of replication lag includes

  • Disk Throughput
  • Concurrency

One possible solution if a secondary cannot keep up is to use w:majority instead of w:1 in the application. This could be set using the connection string URI to set it as default connection-wide. Note that this is also the default write concern starting from MongoDB 5.0.

Also, "MongoDB v4.0 is already out of support since April 2022" so it is recommended to upgrade to a supported version, which is at least the 4.2 series. However I would recommend you to check if you can upgrade to the latest 6.0 series if possible.

For a more complete troubleshooting information regarding replica set, please see: Troubleshoot Replica Sets.

Regards,
Tarun

2 Likes