Hi @Bhuvanesh_R and welcome in the MongoDB Community !
First, let’s talk about RPO vs RTO:
What is RPO?
Recovery Point Objective (RPO) generally refers to the amount of data that can be lost within a period most relevant to a business, before significant harm occurs, from the point of a critical event to the most preceding backup.
What is RTO?
Recovery Time Objective (RTO) often refers to the quantity of time that an application, system and/or process, can be down for without causing significant damage to the business as well as the time spent restoring the application and its data.
I think you meant RPO instead of RTO. Because restoring an entire 30 TB 5 nodes cluster in 5 min… Good luck with that.
Now I’ll assume the goal is a 5 min RPO (=maximum 5 minutes of lost data).
But first of all, 30 TB of data in a single Replica Set (RS) is HUGE. Usually MongoDB clients are recommended to shard their cluster when they reach 2TB of data. Sometimes, depending on the use case and after some discussions with the Technical Service Engineer (TSE), they can push to 4TB of data but not rarely over that.
Usually a healthy MongoDB Cluster needs about 15 to 20% of its storage amount in RAM. So if you have 30 TB, I would recommend you to have ~6 TB of RAM on each machine in your RS…
So to sum up, you should shard.
That being said, let’s get back to the backup problem.
I’m not super familiar with AWS and EBS snapshots. But just to be on the safe side, I would
db.fsyncLock() (doc) the node you want to snapshot before the snapshot. This forces the node to flush to disk all the pending write operations and lock the entire
mongod instance. I think this would be better to ensure consistency of the snapshot.
Let’s talk about the oplog now. If you want a 5 min RPO, you will have to be able to replay the oplog from the time of the last snapshot to the desired timestamp. Which means that you have to record the oplog in another cluster elsewhere.
Which now brings us to this question:
The oplog is idempotent.
This means you can replay the entirety of the oplog you have, whatever the snapshot time, you will always end up in the same state.
Let’s say your snapshot was done at 10am and you had a crash at 10:20am. At 11am you have restored your 10am snapshot to 5 brand new machines (good luck with 30TB… that’s why sharding is also important for the RTO strategy). You can now apply the oplog from 9am => the last oplog entry you got (so probably 10:19 and 55 sec am and the final result will be in the same state than the collection was at 10:19 and 55sec exactly. You could also choose to replay ALL the oplog you have since 3 days ago or just replay from 9:59am, you would be in the same state.
As long as you make sure that you don’t start to replay the oplog after the snapshot time (like 10:01am), you are good to replay whatever you like.
It’s also the reason why it’s important to keep a large amount of oplog (like 3 days). So you can restore the cluster in any state during these 3 days, given that you have a snapshot before that date and still covered by the oplog.
I hope it makes sense.
Just a closing comment about why sharding helps the RTO: it’s easier to bring back up 2TB on 15 shards (15 * 3 nodes) with 2TB on each than restoring 30TB on 3 nodes. When you are sharded, you can start all the data transfer in parallel and your final RTO will be (transfer time of 2TB + 1h of maintenance & machine provisioning + time to replay oplog). If you are on a single RS, then your RTO starts with (transfer time of 30TB).
Oh and closing statement: Everything that we talked about here is entirely automated, coded, carefully designed and implemented in MongoDB Atlas.
Restoring an entire cluster with snapshot + replay the oplog is like 10 mouse clicks top.