Resync Replicaset always restarting from scratch

Hello all,

I am a little stuck with a database that is in production which I want to “secure” and “update”.

The actual crazy situation Is the following (don’t ask me why…):

  • Database version: 3.6 (Yes I know…)
  • Database size: 650Go (950 Go on disk)
  • Servers Structure:
    • 1 Master (correctly sized)
    • 1 secondary (a little smaller)
    • 1 tiny arbiter
  • Data:
    • 1 database with… 40k+ collection and rising (design issue on app side which generates collection for each IOT device)
    • not enough space on primary to duplicate completely files
    • cannot stop for long maintenance of the server (max 30 minutes)

My first issue is that I cannot sync the secondary with the primary:

  • It is always failing and restarting from scratch. (STARTUP2)
  • The speed is ~20mb/s (4.5 days)
  • From what I see; the indexes building takes an extremely long time

My following plans will be:

  • Add a 2 replica set
  • New staging test environment
  • upgrade to 4.0 → 4.2 → 5.0 → (6.0) (App side connectors needs to be updated)
  • Rework the app better to organize the collections (~20 collections)

Have you any idea how I can proceed or help the sync?

Error when it occurs:

E REPL [replication-27] Initial sync attempt failed – attempts left: 9 cause: Networ
kInterfaceExceededTimeLimit: error fetching oplog during initial sync :: caused by :: error in fetcher batch callback:
Operation timed out

Thanks a lot for your help.

Did you check network usage during the sync? the error looks like to be caused by network timeout. Is the sync source overloaded during the sync?

Hello, thanks for your answer.

At the time of the issue:

  • between 2 and 30 mbits/s using mongod synchronization
  • 120 mbits/s using rsync/scp

Finally, I was able to stabilize the cluster but not using mongod:

  • I had a full backup of the machine which was still in the oplog
  • I restored it
    • Mongod was able to resync (~ 20 Go) from this point.

Then I added another replica that syncs normally (80 mbit/s) for several hours.

Now I have the production “stable” and started to test the migration to newer versions.