Mongodb failure to resync a stale member of a replica set

bhargava_vn · April 27, 2022, 10:50am

I have mongodb (version 4.2) replicaset with 3 nodes - primary, secondary, arbiter, primary occupies close to 250 GB disk space, oplog size is 15 GB

secondary was down for few hours, tried recovering it by restarting, it went into recovering forever.

tried initial sync by deleting files on data path, took 15 hours, data path size went to 140GB and failed

tried to copy files from primary and seed it to recover secondary node followed Resync a Member of a Replica Set — MongoDB Manual This did not work - (again stale)

in the latest doc (5.0) they mention to use a new member ID, does it apply for 4.2 as well? changing the member ID throws error as IP and port is same for node I am trying to recover

This method was also unsuccessful, planning to recover the node using different data path and port as primary might consider it as a new node, then once the secondary is up, will change the port to which I want and restart, will it work?

please provide any other suggestions to recover a replica node with large data like 250 GB

Fabio_Ramohitaj · April 27, 2022, 12:20pm

Hy @bhargava_vn,
welcome to the community!

How many indexes do you have?
What kind of error do you have on the secondary?

Best regards

bhargava_vn · April 27, 2022, 2:33pm

Hi @Fabio_Ramohitaj ,
Thanks for the reply
I use many indexes, do you need count?
I managed to get total size of indexes - 123177967616
attaching last snippet of log with error message
error_log.txt (2.1 KB)

Fabio_Ramohitaj · April 27, 2022, 3:15pm

Hi @bhargava_vn,
i see in the log this issue:
initialSyncAttempts: [ { durationMillis: 50655350, status: “HostUnreachable: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback :: caused by :: Error connecting to …”, syncSource: “:270…” }
could it be a network problem?
Check it out and let me know

Best Regards

bhargava_vn · May 9, 2022, 5:30am

I did not have much time to troubleshoot, so went ahead with the plan which we thought would work (1hr downtime)

shut down primary
Copying the data files from primary node, placing it in new db path (other than the recovering nodes db path)
changing log path
starting mongo service with different port (other than the one used by recovering node) (change in DB path, log path and port were done hoping mongodb would consider this as a new node - alternate way compared to what is mentioned in 5.0 doc to use new member ID)
start primary
adding it to replicaset using rs.add(“IP:new port”) on primary

This worked, could see the secondary node coming up successfully

system · May 14, 2022, 5:31am

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.