Looks like you have 252GB RAM for 11.5+10.5 = 22GB of data which is completely overkill so it should work without any problem in theory.
Your oplog is “oversized”. You have 18376h of history in it which is more than 2 years! That’s VERY confortable. Usually a few days is more than enough to allow you to resync a server that had an issue for a few hours.
There is a procedure to resize the oplog but to be honest, I wouldn’t bother. It’s just more confortable and you clearly have room so it shouldn’t be an issue.
You mentioned that you have 3 servers (PSS). Are the 3 servers identical?
If one of your server is completely out of sync, I would reset completely this server and restart it from a backup at this point. It’s probably easier because the sync would just have to sync the difference between the primary and the backup so hopefully that shouldn’t take more than a few minutes / seconds (if your backup is recent, and it should be).
It’s weird that your oplog is that big. By default, it’s supposed to be 5% of free disk space… So I guess you have a very large disk or a specific value in your config file.
From this log, I guess your cluster was failing since February 1st at least and it’s trying to catch up all the write operations that happened since then.
So the question is: what is the fastest way to recover our secondary? Let it replay 4 months worth of write operation or reset everything and restart from a backup that will just need to replay a few hours worth of write operations (present - backup time)?
From what I see, it’s completely normal to have these operations logged in the logs. That’s because operations slower than 100ms are logged by default but it’s expected for this kind of operation with an oplog this large. It’s not an issue as this will stop once our 3 nodes are in sync.
Also, it’s apparently syncing from your other secondary (see readPreference) so your primary shouldn’t be impacted at all and your client workload should be fine I guess.