Replication errors

Hi,

I have MongoDB 4.0 with two replicaset members.

    rs.conf()
    {
    	"_id" : "rs0",
    	"version" : 11,
    	"protocolVersion" : NumberLong(1),
    	"writeConcernMajorityJournalDefault" : true,
    	"members" : [
    		{
    			"_id" : 1,
    			"host" : "192.168.123.86:27017",
    			"arbiterOnly" : false,
    			"buildIndexes" : true,
    			"hidden" : false,
    			"priority" : 3,
    			"tags" : {
    				
    			},
    			"slaveDelay" : NumberLong(0),
    			"votes" : 1
    		},
    		{
    			"_id" : 2,
    			"host" : "192.168.123.87:27017",
    			"arbiterOnly" : false,
    			"buildIndexes" : true,
    			"hidden" : false,
    			"priority" : 1,
    			"tags" : {
    				
    			},
    			"slaveDelay" : NumberLong(0),
    			"votes" : 1
    		}
    	],
    	"settings" : {
    		"chainingAllowed" : true,
    		"heartbeatIntervalMillis" : 2000,
    		"heartbeatTimeoutSecs" : 10,
    		"electionTimeoutMillis" : 10000,
    		"catchUpTimeoutMillis" : 60000,
    		"catchUpTakeoverDelayMillis" : 30000,
    		"getLastErrorModes" : {
    			
    		},
    		"getLastErrorDefaults" : {
    			"w" : 1,
    			"wtimeout" : 0
    		},
    		"replicaSetId" : ObjectId("58764207c0fb84b262e464aa")
    	}
    }

After the initial synchronization the secondary stays days behind the primary server.

    rs.printSlaveReplicationInfo()
    source: 192.168.123.87:27017
    	syncedTo: Mon Jun 29 2020 22:59:56 GMT+0200 (CEST)
    	221205 secs (61.45 hrs) behind the primary 

There are timeout errors every five minutes in the logs:

-- primary log --
    2020-07-02T10:38:50.733+0200 I COMMAND  [LogicalSessionCacheRefresh] command config.$cmd command: update { update: "system.sessions", ordered: false, allowImplicitCollectionCreation: false, writeConcern: { w: "majority", wtimeout: 15000 }, $db: "config" } numYields:0 reslen:383 locks:{ Global: { acquireCount: { r: 1253, w: 1165 } }, Database: { acquireCount: { w: 1165 } }, Collection: { acquireCount: { w: 1165 } } } storage:{} protocol:op_msg 30651ms
    2020-07-02T10:38:50.743+0200 I CONTROL  [LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true }
-- secondary log --
    2020-07-02T10:39:32.575+0200 I NETWORK  [LogicalSessionCacheReap] Starting new replica set monitor for rs0/192.168.123.86:27017,192.168.123.87:27017
    2020-07-02T10:39:32.577+0200 I NETWORK  [LogicalSessionCacheReap] Successfully connected to 192.168.123.86:27017 (1 connections now open to 192.168.123.86:27017 with a 0 second timeout)
    2020-07-02T10:39:32.577+0200 I NETWORK  [LogicalSessionCacheRefresh] Successfully connected to 192.168.123.86:27017 (2 connections now open to 192.168.123.86:27017 with a 0 second timeout)
    2020-07-02T10:39:32.577+0200 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for rs0/192.168.123.86:27017,192.168.123.87:27017
    2020-07-02T10:39:32.577+0200 I NETWORK  [LogicalSessionCacheRefresh] Starting new replica set monitor for rs0/192.168.123.86:27017,192.168.123.87:27017
    2020-07-02T10:39:48.441+0200 I CONTROL  [LogicalSessionCacheRefresh] Failed to refresh session cache: WriteConcernFailed: waiting for replication timed out; Error details: { wtimeout: true }

What can I do to synchronize the replica set?

wbr Tomaz

Hi Tomaz,

I’m not sure why your replication state is like this, although it’s been some time since you posted this question. Are you still having this issue?

If yes, could you post:

  • your MongoDB version
  • output of rs.status()
  • output of rs.printReplicationInfo()
  • how long does the initial sync last?

And also please describe the hardware provisioned for the two nodes.

Note that having an even number replica set nodes is not a recommended configuration. It is recommended to have at least three nodes for high availability purposes. Please see Replica Set Deployment Architectures for more information.

Best regards,
Kevin

Hi Kevin,
The MongoDB version is 4.0.19 running on Ubuntu 18.04 and it now doesn’t have this errors any more. I think the system was just overloaded, the load average was almost at 5. The rs.status() showed heartbeat was working but the rs.printReplicationInfo() stated that secondary is more than 60 hours behind.
Could the reason be that the total index size is higher than system memory (64GB) and MongoDB is continusly reloading indices?
Thanks for your suggestions.
wbr Tomaz

Hi Kevin,

I also have PSA architecture setup and in logs I am also seeing same error message and it takes almost 10 hours for my secondary server to sync and once it syncs back, again after few hours it will start lagging from primary. I have resized the oplog also almost to 2 TB , still the issue is not resolved. Please can you let us know how to fix this issue?
Which way can we handle this issue as we have baremetals and load average is not more than 5.

Hi Kevin,

I am still facing the same lag issue , I tried initial sync twice but it was of no use. I see the same msg continuously in the log:

Thu Feb 4 12:20:09.239 I NETWORK [LogicalSessionCacheRefresh] Starting new replica set monitor for MONGO_PR/xxx.com:12011,xxx.com:12012,yyy.com:12013.

Hi Mamatha,

Are you still facing this issue. As requested by Venkataraman in other thread:
Can you please check the following output to check if SEC is running into any hang issue.

  • db.serverStatus().connections
  • db.currentOp().inprog.length
  • lsof | wc -l

Also please share the OS information.

Thanks,
Kiran