What is the major reason for secondary server going into Recovery state

Sravan_Chowdary_Bala · July 13, 2022, 6:13am

Hi Team,

Sorry if it’s a duplicate question.

We do have one primary and one secondary replication setup. I’m facing an issue that the secondary server is going into a recovery state frequently. Could you please help me to know the major and exact reason for the cause so that I can prevent the issue in the future?
I’ve stopped the mongod service on secondary and deleted the data and again restarted the service. Now the secondary state show as “STARTUP2” until the sync gets completed. Not sure what I’m missing. Even the last time when I followed the same process the secondary went into a Recovery state from “STARTUP2” once the sync got completed. The primary and secondary servers are having the same config. Please let me know if you require any more details.
Below is the error I could find in the mongod.log.

“ctx”:“ReplCoordExtern-0”,“msg”:“Recreating cursor for oplog fetcher due to error”,“attr”:{“lastOpTimeFetched”:{“ts”:{"$timestamp":{“t”:1657624021,“i”:126}},“t”:1},“attemptsRemaining”:1,“error”:“CappedPositionLost: Error while getting the next batch in the oplog fetcher :: caused by :: CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(7119440959259017343)”}}

Felipe_Esteves1 · July 18, 2022, 8:37pm

Hi,

There are 2 dimensions to be considered in the replication proccess: the system resources available (cpu, ram, disk…) and the oplog size available.
I see you have the same config in your servers, but if the Secondary receives a lot of read operations, for example, this can impact the replication performance, as both can be I/O intensive.

If I understood correctly, you’ve tried to resync the node, but as soon as the STARTUP2 finishes, it goes into RECOVERY state again.
This happens because the time spent during the resync is greater than the oplog window size in the secondary node. So, when it finishes the resync, it has to replay the oplog, but the first positions are already lost, replaced for newer ones.

If you see no server overhead that can be tuned to fasten the replication proccess, I suggest you set a larger OPLOG SIZE in both primary and secondary nodes, so you can have a oplog window that is larger than the full replication time.

Sravan_Chowdary_Bala · July 25, 2022, 10:31am

Sorry, if it’s a dumb question. I’m very new to MongoDB.

How to consider the OPLOG SIZE to be set?