Slow Replication Lag & Recovery of Replica Set

I am currently testing mongodb for a potentional use on a live system altough I have encountered an issue with mongodb replication.

I have PSA setup on mongodb v4.2.6. 3 Servers with 10 core CPU, 20GB RAM ea. I am currently testing how MongoDB behaves if for instance I stop the replication manually.

Configuration :
oplog was configured for each server with 160GB of size which is more than I need for the test that I am conducting also Secondary Sync Target was configured rs.syncFrom(“mongo01:27017”) ( to the Primary server)

PRIMARY> rs.printReplicationInfo()
configured oplog size:   160000MB
log length start to end: 12799secs (3.56hrs)
oplog first event time:  Wed Apr 22 2020 11:21:03 GMT+0000 (UTC)
oplog last event time:   Wed Apr 22 2020 14:54:22 GMT+0000 (UTC)
now:                     Wed Apr 22 2020 14:54:25 GMT+0000 (UTC)

When conducting the test I am inserting roughly 20k transactions of JSON per second. After one minute I am stopping the replica server by turning off mongod and let the inserts continue for 5 for minutes. After this procedure I turn on the replica and these are the 2 issues I have encountered:

  1. When turning off the replication the transactions fall down to 2K per second compared to 20K per second I was achieving when my PSA set up was still up and running I would like to know why this is happening because I cannot see any errors apart from the fact that the secondary server is down. Can you provide any inside on this ?

  2. When restarting the replica the tps still remains at 2K per second and the replication lag continues to increase without eventually never recovering, I can see that this is not an oplog issue because I didn’t exceed the configured size.

I can see some main flaws here, what happens if I have to recover my replica even after 5 minutes, do I have to recover everything by using this method : Replica Set Resync by Copying This isn’t as feasible as having the replica replicate the changes required only considering this is just 5 minutes of data (600k entries considering that the inserting rate dropped to 2K per sec)

Is there any way to fix these issues please? Maybe a configuration which I am missing

What happens to the TPS rate if you let the writes go beyond 1 minute without shutting down a replica? I’m wondering if your file system cache/storage subsystem is slowing down because of other reasons.

The TPS rate remains stable, I actually tested the whole setup for a whole day with 20k TPS without having any issues whatsoever, the issue started when I stopped the secondary with the first test just for 5 minutes since I saw that huge reduction in TPS, furthermore when starting the secondary again after those 5 minutes, even after stopping the application which was passing the data to mongo, the replication lag never recovered.

What are you testing with? My tendency is too look at that rather than mongodb.

As for rejoining node not catching up, you’ll need to look at the logs.