Hi @Timur_Pyatkov welcome to the community!
There are a lot of information to unpack here, but if I understand correctly, here is the situation:
- You are attempting to add a replica set member into an existing single-node replica set. However this fails because the current member is overloaded or OOMkilled while it’s trying to do initial sync.
- The deployment is being used in a production capacity while the initial sync is progressing.
- The deployment works well as a single-node replica set. The issues only start happening after adding a new replica set member.
Are those accurate descriptions of the issue you’re seeing?
If yes, it seems that your deployment’s resources are just enough to service your production load but cannot handle the load of production+initial sync. The initial sync process is relatively resource intensive, since it requires the server to do a collection scan (i.e. read everything into memory) to send it to the syncing secondary.
The WiredTiger metrics numbers you posted pointed to a heavily overloaded system. At these numbers, I would expect to see the database to stop responding to inputs since it has such a large backlog of work to do, and the underlying hardware cannot service the work (e.g. read & write from disk) fast enough. I believe you observed this stalling as well.
Since the disk cannot service the work fast enough, MongoDB was forced to hold onto an increasing amount of pending work in memory, resulting in increasing memory usage until it was killed by the OOMkiller.
I can think of some possible ways to move forward:
- Increase the hardware provisioning for the server until it can handle the production+initial sync load without stalling. Typically this means a much larger memory and faster disks. This may be the simplest, most straightforward method.
- Stop production workload while the initial sync is progressing, which could take a while since you have a large amount of data.
- Follow the procedure to Sync by Copying Data Files from Another Member. Note that this requires downtime while file copying is happening, and is the most complicated method of the three.
On another note, oplog size can actually go beyond its maxSize (see Replica Set Oplog) if it’s required for the correct working of the server. This is one sign that the server is overloaded.
Best regards,
Kevin