Why the new replication member not continuing the replication?

Abdelrahman_N_A · January 13, 2021, 1:41pm

Hello Dears,
I added a new mongo member to my replica set ,when the member exceeded 280 GB and after 10 hours , the new replication member starting again from scratch,
I can’t detect the root case for restarting from scratch ,and I don’t know if the problem became from the replication parameters such as “initialSyncTransientErrorRetryPeriodSeconds” ,“oplogInitialFindMaxSeconds”,
Does anyone have a concern or recommendation about my case?,
Notes : my database exceeded 3.5 TB , version 4.4
Thanks a lot

chris · January 13, 2021, 2:41pm

Hard to determine without logs from the added member. If you post these the relevant information for the restart of sync.

Fault tolerance

If a secondary performing initial sync encounters a non-transient (i.e. persistent) network error during the sync process, the secondary restarts the initial sync process from the beginning.

Starting in MongoDB 4.4, a secondary performing initial sync can attempt to resume the sync process if interrupted by a transient (i.e. temporary) network error, collection drop, or collection rename. The sync source must also run MongoDB 4.4 to support resumable initial sync. If the sync source runs MongoDB 4.2 or earlier, the secondary must restart the initial sync process as if it encountered a non-transient network error.

By default, the secondary tries to resume initial sync for 24 hours. MongoDB 4.4 adds the initialSyncTransientErrorRetryPeriodSeconds server parameter for controlling the amount of time the secondary attempts to resume initial sync. If the secondary cannot successfully resume the initial sync process during the configured time period, it selects a new healthy source from the replica set and restarts the initial synchronization process from the beginning.

The secondary attempts to restart the initial sync up to 10 times before returning a fatal error.

Abdelrahman_N_A · January 13, 2021, 8:07pm

The chunks collection may be caused the problem, the size of collection exceeded 1.5 TB ,

{“t”:{"$date":“2021-01-13T22:39:39.179+04:00”},“s”:“I”, “c”:“INITSYNC”, “id”:21183, “ctx”:“ReplCoordExtern-4”,“msg”:“Finished cloning data. Beginning oplog replay”,“attr”:{“databaseClonerFinishStatus”:“InitialSyncFailure: CallbackCanceled: Error cloning collection ‘MyDB.fs.chunks’ :: caused by :: Initial sync attempt canceled”}}
{“t”:{"$date":“2021-01-13T22:39:39.179+04:00”},“s”:“I”, “c”:“INITSYNC”, “id”:21191, “ctx”:“ReplCoordExtern-4”,“msg”:“Initial sync attempt finishing up”}
{“t”:{"$date":“2021-01-13T22:39:39.179+04:00”},“s”:“I”, “c”:“INITSYNC”, “id”:21192, “ctx”:“ReplCoordExtern-4”,“msg”:“Initial Sync Attempt Statistics”,“attr”:{“statistics”:{“failedInitialSyncAttempts”:2,“maxFailedInitialSyncAttempts”:10,“initialSyncStart”:{"$date":“2021-01-11T21:33:46.425Z”},“initialSyncAttempts”:[{“durationMillis”:42270609,“status”:“MaxTimeMSExpired: error fetching oplog during initial sync :: caused by :: Error while getting the next batch in the oplog fetcher :: caused by :: operation exceeded time limit”,“syncSource”:“10.74.4.24:27017”,“rollBackId”:1,“operationsRetried”:1,“totalTimeUnreachableMillis”:2245},{“durationMillis”:93613792,“status”:“MaxTimeMSExpired: error fetching oplog during initial sync :: caused by :: Error while getting the next batch in the oplog fetcher :: caused by :: operation exceeded time limit”,“syncSource”:“10.74.4.24:27017”,“rollBackId”:1,“operationsRetried”:10,“totalTimeUnreachableMillis”:19056}],“appliedOps”:0,“initialSyncOplogStart”:{"$timestamp":{“t”:1610536774,“i”:5}},“initialSyncOplogFetchingStart”:{"$timestamp":{“t”:1610536773,“i”:7}},“totalTimeUnreachableMillis”:15213,“databases”:{“databasesCloned”:1,“admin”:{“collections”:3,“clonedCollections”:3,“start”:{"$date":“2021-01-13T11:19:34.542Z”},“end”:{"$date":“2021-01-13T11:19:37.545Z”},“elapsedMillis”:3003,“admin.system.version”:{“documentsToCopy”:2,“documentsCopied”:2,“indexes”:1,“fetchedBatches”:1,“start”:{"$date":“2021-01-13T11:19:34.705Z”},“end”:{"$date":“2021-01-13T11:19:35.612Z”},“elapsedMillis”:907,“receivedBatches”:1},“admin.system.users”:{“documentsToCopy”:2,“documentsCopied”:2,“indexes”:2,“fetchedBatches”:1,“start”:{"$date":“2021-01-13T11:19:35.612Z”},“end”:{"$date":“2021-01-13T11:19:37.221Z”},“elapsedMillis”:1609,“receivedBatches”:1},“admin.system.keys”:{“documentsToCopy”:3,“documentsCopied”:3,“indexes”:1,“fetchedBatches”:1,“start”:{"$date":“2021-01-13T11:19:37.221Z”},“end”:{"$date":“2021-01-13T11:19:37.545Z”},“elapsedMillis”:324,“receivedBatches”:1}},“MyDB.fs.chunks”:{“documentsToCopy”:4135045,“documentsCopied”:8019,“indexes”:3,“fetchedBatches”:302,“start”:{"$date":“2021-01-13T12:17:01.011Z”},“receivedBatches”:302}}},“config”:{“collections”:0,“clonedCollections”:0},“test”:{“collections”:0,“clonedCollections”:0}}}}}

chris · January 13, 2021, 8:50pm

Looks like it may be related to fetching the oplog. It could be ageing out before before this new secondary can replicate it.

It could be worth checking the output of db.printReplicationInfo()

...
"initialSyncAttempts": [
                {
                    "durationMillis": 42270609,
                    "status": "MaxTimeMSExpired: error fetching oplog during initial sync :: caused by :: Error while getting the next batch in the oplog fetcher :: caused by :: operation exceeded time limit",
                    "syncSource": "10.74.4.24:27017",
                    "rollBackId": 1,
                    "operationsRetried": 1,
                    "totalTimeUnreachableMillis": 2245
                },
                {
                    "durationMillis": 93613792,
                    "status": "MaxTimeMSExpired: error fetching oplog during initial sync :: caused by :: Error while getting the next batch in the oplog fetcher :: caused by :: operation exceeded time limit",
                    "syncSource": "10.74.4.24:27017",
                    "rollBackId": 1,
                    "operationsRetried": 10,
                    "totalTimeUnreachableMillis": 19056
                }
            ],
...

Abdelrahman_N_A · January 14, 2021, 6:51am

I tried to run the command “db.printReplicationInfo()” , but because the server status is startup2 ,the server raised error,

uncaught exception: Error: error: {
        "topologyVersion" : {
                "processId" : ObjectId("5ffcc41079e2c352127fb36d"),
                "counter" : NumberLong(2)
        },
        "operationTime" : Timestamp(0, 0),
        "ok" : 0,
        "errmsg" : "Oplog collection reads are not allowed while in the rollback or startup state.",
        "code" : 13436,
        "codeName" : "NotMasterOrSecondary",
        "$clusterTime" : {
                "clusterTime" : Timestamp(1610606466, 4),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        }
}

chris · January 14, 2021, 2:38pm

That should be run on your existing cluster.

Abdelrahman_N_A · January 15, 2021, 10:29pm

the problem fixed when I increased the following parameters
db.adminCommand( { setParameter: 1, initialSyncTransientErrorRetryPeriodSeconds: 864000 } )
db.adminCommand( { setParameter: 1, oplogInitialFindMaxSeconds: 600 } )

system · January 20, 2021, 10:29pm

This topic was automatically closed 5 days after the last reply. New replies are no longer allowed.