Restarting oplog query due to error: CappedPositionLost

bhargava_vn · April 20, 2023, 3:55pm

Hi,

I have a replicaset with primary, seconary and arbiter, often secondary node falls behind and goes into rollback and then recovering.

I could see these logs from the failed node, trying to understand what does the log actually mean, so that I can take corrective measures to prevent this situation.

2023-01-08T10:09:54.311+0000 I REPL [replication-45623] Restarting oplog query due to error: CappedPositionLost: error in fetcher batch callback :: caused by :: CollectionScan died due to position in capped collection being deleted. Last seen record id: RecordId(7186192181332281845). Last fetched optime: { ts: Timestamp(1673165751, 2548), t: 87 }. Restarts remaining: 1

2023-01-08T10:09:54.457+0000 I REPL [replication-45623] Scheduled new oplog query Fetcher source: IP:27017 database: local query: { find: "oplog.rs", filter: { ts: { $gte: Timestamp(1673165751, 2548) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 87, readConcern: { afterClusterTime: Timestamp(0, 1) } } query metadata: { $replData: 1, $oplogQueryData: 1, $readPreference: { mode: "secondaryPreferred" } } active: 1 findNetworkTimeout: 7000ms getMoreNetworkTimeout: 10000ms shutting down?: 0 first: 1 firstCommandScheduler: RemoteCommandRetryScheduler request: RemoteCommand 2504256772 -- target:IP:27017 db:local cmd:{ find: "oplog.rs", filter: { ts: { $gte: Timestamp(1673165751, 2548) } }, tailable: true, oplogReplay: true, awaitData: true, maxTimeMS: 2000, batchSize: 13981010, term: 87, readConcern: { afterClusterTime: Timestamp(0, 1) } } active: 1 callbackHandle.valid: 1 callbackHandle.cancelled: 0 attempt: 1 retryPolicy: {type: "NoRetryPolicy"}

2023-01-08T10:09:55.024+0000 I REPL [rsBackgroundSync] Starting rollback due to OplogStartMissing: Our last optime fetched: { ts: Timestamp(1673165751, 2548), t: 87 }. source's GTE: { ts: Timestamp(1673165810, 2079), t: 87 }

2023-01-08T10:09:55.239+0000 I REPL [rsBackgroundSync] Replication commit point: { ts: Timestamp(1673165573, 4027), t: 87 }

2023-01-08T10:09:55.239+0000 I REPL [rsBackgroundSync] Waiting for all operations from { ts: Timestamp(1673165574, 262), t: 87 } until { ts: Timestamp(1673165751, 2548), t: 87 } to be applied before starting rollback.

I have keep getting WriteConcernFailed a lot before these, does it mean it is causing replication lag

Also please direct me to any resource available which explains how to investigate the cause of node failure from available logs, understand the meaning of different types of messages in logs.

Thanks