Backup strategy for the replica set with Filesystem backup

Im planning to create a backup policy for my mongo DB setup. It has 5 nodes in the cluster and 30TB of data on each node.

I want to meet the RTO for 5minutes. Im using AWS Cloud. So planning to use EBS snapshots for this. In my case, the journaling files are located in the same disk where data files are located. But I have one more cluster(5TB data) where journaling is on another disk.

Need your suggestions to implement a better backup solution.

This is what Im thinking. Take a snapshot every 1hr on the primary node(the script will check who is primary). Then keep archiving the oplog data into somewhere(continuos sync) . But don’t know how it is possible.

Recovery scenario:

Snapshot was taken: 10AM
Oplog: continuos sync
Cluster crash: 10:20 AM

And if I want to perform point in time recovery then I can restore the most recent AWS snapshot but on top of it I need to apply the oplog, But at which point do I need to start restoring the oplog? Because during the snapshot I don’t know the last committed transition. So Any pointers to know about this?

Hi @Bhuvanesh_R and welcome in the MongoDB Community :muscle: !

First, let’s talk about RPO vs RTO:

What is RPO?

Recovery Point Objective (RPO) generally refers to the amount of data that can be lost within a period most relevant to a business, before significant harm occurs, from the point of a critical event to the most preceding backup.

What is RTO?

Recovery Time Objective (RTO) often refers to the quantity of time that an application, system and/or process, can be down for without causing significant damage to the business as well as the time spent restoring the application and its data.

I think you meant RPO instead of RTO. Because restoring an entire 30 TB 5 nodes cluster in 5 min… Good luck with that.

Now I’ll assume the goal is a 5 min RPO (=maximum 5 minutes of lost data).

But first of all, 30 TB of data in a single Replica Set (RS) is HUGE. Usually MongoDB clients are recommended to shard their cluster when they reach 2TB of data. Sometimes, depending on the use case and after some discussions with the Technical Service Engineer (TSE), they can push to 4TB of data but not rarely over that.

Usually a healthy MongoDB Cluster needs about 15 to 20% of its storage amount in RAM. So if you have 30 TB, I would recommend you to have ~6 TB of RAM on each machine in your RS…

So to sum up, you should shard.

That being said, let’s get back to the backup problem.

I’m not super familiar with AWS and EBS snapshots. But just to be on the safe side, I would db.fsyncLock() (doc) the node you want to snapshot before the snapshot. This forces the node to flush to disk all the pending write operations and lock the entire mongod instance. I think this would be better to ensure consistency of the snapshot.

Let’s talk about the oplog now. If you want a 5 min RPO, you will have to be able to replay the oplog from the time of the last snapshot to the desired timestamp. Which means that you have to record the oplog in another cluster elsewhere.

Which now brings us to this question:

The oplog is idempotent.

This means you can replay the entirety of the oplog you have, whatever the snapshot time, you will always end up in the same state.

Let’s say your snapshot was done at 10am and you had a crash at 10:20am. At 11am you have restored your 10am snapshot to 5 brand new machines (good luck with 30TB… that’s why sharding is also important for the RTO strategy). You can now apply the oplog from 9am => the last oplog entry you got (so probably 10:19 and 55 sec am and the final result will be in the same state than the collection was at 10:19 and 55sec exactly. You could also choose to replay ALL the oplog you have since 3 days ago or just replay from 9:59am, you would be in the same state.

As long as you make sure that you don’t start to replay the oplog after the snapshot time (like 10:01am), you are good to replay whatever you like.

It’s also the reason why it’s important to keep a large amount of oplog (like 3 days). So you can restore the cluster in any state during these 3 days, given that you have a snapshot before that date and still covered by the oplog.

I hope it makes sense. :slight_smile:

Just a closing comment about why sharding helps the RTO: it’s easier to bring back up 2TB on 15 shards (15 * 3 nodes) with 2TB on each than restoring 30TB on 3 nodes. When you are sharded, you can start all the data transfer in parallel and your final RTO will be (transfer time of 2TB + 1h of maintenance & machine provisioning + time to replay oplog). If you are on a single RS, then your RTO starts with (transfer time of 30TB).

Oh and closing statement: Everything that we talked about here is entirely automated, coded, carefully designed and implemented in MongoDB Atlas.

Restoring an entire cluster with snapshot + replay the oplog is like 10 mouse clicks top.

Cheers,
Maxime.

Thanks a lot for your detailed answer. One last question about replaying oplog.

My backup process will be,

  1. Pause writes by issues fsynclock()
  2. Check everything flushed
  3. Get the max oplog position(i hope its a timestamp value)
  4. Take Snapshot of the data disk.
  5. Once snapshot done, resume the writing.

Lets say, the max oplog position from the above backup is 10:02AM.

Then the crash happened at 10:30AM, and i have oplog backup till 10:29AM.

Now restoring the oplog,I’ll set the start time as 10:00 AM.
What happens if any transaction at that particular time is already captured by snapshot(till 10:02AM is already covered in snapshot) and I’m replaying them again via oplog.

Will it cause any duplicates?

No because the oplog is idempotent. You could replay the oplog entirely 10 times, the databases and collection at the end will always end up in the same state.

If you want, the oplog only contains the result of the transaction, not the transaction itself. Each operation you run are transformed into an idempotent entry in the oplog.

For example.

insert doc {name: "Max"} in coll becomes doc {_id: ObjectId(xx), name: "Max"} inserted in coll. When you replay the oplog, it will replay this operation as an upsert with the unique ID. If the doc exists => does nothing. If it doesn’t, it’s created in that state.

Another example. Let’s say you want to $inc an age by 1 (birthday of someone in your person collection).
The result in the oplog is: in persons collection, SET age of doc with _id= ObjectId(xx) to 10.

When you executed that update query, age was 9 => $inc +1 to 10. Only the result of the query is stored in the oplog. Not the query itself. If you replay the oplog. The command is SET AGE 10. Not $inc age by 1.

You can replay this 10 times => Age will still be 10. If the oplog was storing the $inc query, it would be 19. which would not work.

Cheers,
Maxime.