Slow Replication Causing Replica Set Secondary to Repetitively Fall out of Sync

Neil_Allen · November 15, 2021, 8:06pm

In the last month, we’ve been encountering issues with our HA configuration when it comes to replication of the primary and secondary servers. We are currently running MongoDB v4.4.8 across our environment.

HA Config:
Site 1: 1 Primary server, 22TB of data
Site 2: 1 Secondary server, voting, electible
Site 3: 1 Arbiter

The specs on the primary and secondary are CPU: 28core, 56 thread and 256GB RAM and 10Gbit network interfaces.

Oplog is set to 450GB

rs0:SECONDARY> rs.printReplicationInfo()
configured oplog size:   450000MB
log length start to end: 153741secs (42.71hrs)
oplog first event time:  Sat Nov 13 2021 12:05:09 GMT-0600 (CST)
oplog last event time:   Mon Nov 15 2021 06:47:30 GMT-0600 (CST)
now:                     Mon Nov 15 2021 13:52:15 GMT-0600 (CST)

Our secondary has fallen completely out of sync twice in the last month which is very concerning considering the fact that the oplog give us 40+ hours of runway. It became even more of an issue when it came to creating indexes as we had to remove our secondary’s voting and priority rights to get our primary back to the expected performance for our application.

From research, this typically falls down to network speeds/bandwidth, latency, hardware specs, or configuration issues. The network and hardware specs are not a concern between the sites but how can I narrow down on the latency and potential configuration issues? Long term, I think we need to scale out sharding to handle our data intake but we need to at least restore our desired HA configuration so we can work towards the desired endstate. We just currently have a lot of pain points with the current setup that was not designed correctly for the current use case so any feedback/suggestions would be very greatly appreciated!

rs.conf():
rs0:SECONDARY> rs.conf()
{
	"_id" : "rs0",
	"version" : 171450,
	"term" : 18,
	"protocolVersion" : NumberLong(1),
	"writeConcernMajorityJournalDefault" : true,
	"members" : [
		{
			"_id" : 2,
			"host" : "mongodb04:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 1,
			"tags" : {
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 3,
			"host" : "arbiter0:27017",
			"arbiterOnly" : true,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 0,
			"tags" : {
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 1
		},
		{
			"_id" : 4,
			"host" : "mongodb03:27017",
			"arbiterOnly" : false,
			"buildIndexes" : true,
			"hidden" : false,
			"priority" : 0,
			"tags" : {
			},
			"slaveDelay" : NumberLong(0),
			"votes" : 0
		}
	],
	"settings" : {
		"chainingAllowed" : true,
		"heartbeatIntervalMillis" : 2000,
		"heartbeatTimeoutSecs" : 10,
		"electionTimeoutMillis" : 10000,
		"catchUpTimeoutMillis" : -1,
		"catchUpTakeoverDelayMillis" : 30000,
		"getLastErrorModes" : {
		},
		"getLastErrorDefaults" : {
			"w" : 1,
			"wtimeout" : 0
		},
		"replicaSetId" : ObjectId("5ef3f5c0ea85cbd0890c2fa4")
	}
} 
rs.status():
    rs0:SECONDARY> rs.status()
    {
    	"set" : "rs0",
    	"date" : ISODate("2021-11-15T20:02:40.060Z"),
    	"myState" : 2,
    	"term" : NumberLong(18),
    	"syncSourceHost" : "mongodb04:27017",
    	"syncSourceId" : 2,
    	"heartbeatIntervalMillis" : NumberLong(2000),
    	"majorityVoteCount" : 2,
    	"writeMajorityCount" : 1,
    	"votingMembersCount" : 2,
    	"writableVotingMembersCount" : 1,
    	"optimes" : {
    		"lastCommittedOpTime" : {
    			"ts" : Timestamp(1637006807, 3168),
    			"t" : NumberLong(18)
    		},
    		"lastCommittedWallTime" : ISODate("2021-11-15T20:06:47.588Z"),
    		"readConcernMajorityOpTime" : {
    			"ts" : Timestamp(1636981009, 2780),
    			"t" : NumberLong(18)
    		},
    		"readConcernMajorityWallTime" : ISODate("2021-11-15T12:56:50.315Z"),
    		"appliedOpTime" : {
    			"ts" : Timestamp(1636981009, 2780),
    			"t" : NumberLong(18)
    		},
    		"durableOpTime" : {
    			"ts" : Timestamp(1636981009, 2780),
    			"t" : NumberLong(18)
    		},
    		"lastAppliedWallTime" : ISODate("2021-11-15T12:56:50.315Z"),
    		"lastDurableWallTime" : ISODate("2021-11-15T12:56:50.315Z")
    	},
    	"members" : [
    		{
    			"_id" : 2,
    			"name" : "mongodb04:27017",
    			"health" : 1,
    			"state" : 1,
    			"stateStr" : "PRIMARY",
    			"uptime" : 34617,
    			"optime" : {
    				"ts" : Timestamp(1637006807, 3771),
    				"t" : NumberLong(18)
    			},
    			"optimeDurable" : {
    				"ts" : Timestamp(1637006807, 3168),
    				"t" : NumberLong(18)
    			},
    			"optimeDate" : ISODate("2021-11-15T20:06:47Z"),
    			"optimeDurableDate" : ISODate("2021-11-15T20:06:47Z"),
    			"lastHeartbeat" : ISODate("2021-11-15T20:02:39.879Z"),
    			"lastHeartbeatRecv" : ISODate("2021-11-15T20:02:39.513Z"),
    			"pingMs" : NumberLong(30),
    			"lastHeartbeatMessage" : "",
    			"syncSourceHost" : "",
    			"syncSourceId" : -1,
    			"infoMessage" : "",
    			"electionTime" : Timestamp(1630031492, 1),
    			"electionDate" : ISODate("2021-08-27T02:31:32Z"),
    			"configVersion" : 171450,
    			"configTerm" : 18
    		},
    		{
    			"_id" : 3,
    			"name" : "arbiter0:27017",
    			"health" : 1,
    			"state" : 7,
    			"stateStr" : "ARBITER",
    			"uptime" : 34617,
    			"lastHeartbeat" : ISODate("2021-11-15T20:02:38.930Z"),
    			"lastHeartbeatRecv" : ISODate("2021-11-15T20:02:38.173Z"),
    			"pingMs" : NumberLong(39),
    			"lastHeartbeatMessage" : "",
    			"syncSourceHost" : "",
    			"syncSourceId" : -1,
    			"infoMessage" : "",
    			"configVersion" : 171450,
    			"configTerm" : 18
    		},
    		{
    			"_id" : 4,
    			"name" : "mongodb03:27017",
    			"health" : 1,
    			"state" : 2,
    			"stateStr" : "SECONDARY",
    			"uptime" : 34692,
    			"optime" : {
    				"ts" : Timestamp(1636981009, 2780),
    				"t" : NumberLong(18)
    			},
    			"optimeDate" : ISODate("2021-11-15T12:56:49Z"),
    			"syncSourceHost" : "mongodb04:27017",
    			"syncSourceId" : 2,
    			"infoMessage" : "",
    			"configVersion" : 171450,
    			"configTerm" : 18,
    			"self" : true,
    			"lastHeartbeatMessage" : ""
    		}
    	],
    	"ok" : 1,
    	"$clusterTime" : {
    		"clusterTime" : Timestamp(1637006807, 3771),
    		"signature" : {
    			"hash" : BinData(0,"Y2W6fSCoS7BqKUvYnc/Ww7+mjPw="),
    			"keyId" : NumberLong("6976771418520289346")
    		}
    	},
    	"operationTime" : Timestamp(1636981009, 2780)
    }

kevinadi · November 18, 2021, 1:12am

Hi @Neil_Allen welcome to the community!

First of all, I would encourage you to upgrade as soon as possible to MongoDB 4.4.10 due to critical issues identified in 4.4.8.

It seems to me from your description that you have a lot of data inserted into MongoDB almost all the time. So much, in fact, that 450GB of oplog is not enough. Is this correct? How much data are you inserting?

A quick fix that I can think of off the top of my head is to use majority write concern for all your writes. This will allow natural throttling for the inserts so that it won’t overwhelm your replica set. In short, using majority write concern means that the write will be acknowledged after it was replicated to the secondary, so there’s less chance of it falling off the oplog.

One caveat with majority write concern is that there might be issues with arbiters. Should the secondary become offline for some reason, the writes could hang indefinitely waiting for confirmation from the offline secondary (since arbiters cannot acknowledge writes), so some mitigation plan is probably needed for this case.

Best regards
Kevin

Neil_Allen · November 18, 2021, 6:21pm

Hey @kevinadi I really appreciate the reply!!!

Thanks for pointing out the 4.4.8 vulnerabilities and we definitely need to get on 4.4.10 as soon as we can.

We do have a ton of data that gets inserted into MongoDB on the daily. In the last year, our database has grown from 12TB to 22TB. Filtering out what data we truly need is another discussion as we need to do some cleansing on our data. We typically see about 30GB of data ingest daily, but our 450GB oplog is enough to give us 42 hours worth of runway. The concern is that we keep falling that we find ourselves outside of that oplog range.

I’ve seen numerous bits of documentation around write concern, and I think that is the right direction. I do have a question and bear with me if I ask something conceptionally incorrect, but if we were to change that configuration, would we see our database not accepting writes until our secondary caught up in the oplog? I just want to make sure that I’m not sacrificing the health of the primary in order to have the secondary caught up.

Would changing the default setting for flowControlTargetLagSeconds to a lower value be another way to close the gap between the databases? I’ve just seen this mentioned a few times in documentation but I’ve failed to find any further discussion on the setting.

kevinadi · November 19, 2021, 6:01am

Hi @Neil_Allen

Note that the output of rs.printReplicationInfo() simply takes the timestamp of the first entry in the oplog and compare it against the last entry, so 42 hours is what is currently stored in the oplog. This will vary depending on specific situation, so if you have a lot of data coming in, you might find that it will be less than this.

No, it’s just that that particular write won’t return an acknowledgement until the write has propagated to the secondaries. That won’t block other writes coming in.

flowControlTargetLagSeconds is another setting to attempt throttling for incoming writes. However, in terms of modifying the flowControlTargetLagSeconds server parameter, I would suggest you to use the default value for now, and use the majority write concern first to see if it fixes your issue, and change this from its default value as a last resort.

Best regards
Kevin