Sometimes when MongoDB users evaluate Cloud Manager, they question whether they need a backup if they are already using replication. Doesn’t replication already protect their data sufficiently?
The answer can be reduced to a single distinction: replication is for availability and backups are for disaster recovery.
Availability is a measure of percentage of time an application is working and accessible to end users at full capacity.^1 Over time, infrastructures always suffer predictable failures that can impact this availability. All drives eventually fail, often without warning. So do power supplies, network cards, and other individual machine components. Sometimes your server runs out of space on its disk, at which point you’ll need to take it offline so you can install a bigger one. These issues impact cloud infrastructures as well, although in different ways. For example, you might decide you need to re-provision a machine with twice as much RAM.
All these issues would lead to your system being unavailable. To insulate infrastructures from these completely inevitable events, we build redundancy into our systems. If a component becomes unavailable, a standby system takes over immediately and transparently. The MongoDB feature that enables redundancy is called Replication.
For those not familiar, when you organize MongoDB into a replica set, all members of the set stay in sync. All database writes go to the primary member of the replica set and are quickly synced to all secondary members. In the event a primary node fails, an election takes place between the remaining members and a new primary is chosen. Members can also be removed from the set for maintenance, and reintegrated seamlessly.
Replication ensures that in the event of a node failure, your application will still be available. Failover occurs automatically within seconds and with few exceptions will be invisible to end users. As a result, individual machines can fail, or be serviced, without impacting your availability.
Disaster recovery, in contrast, is about dealing with events that are far less predictable, and which by their nature defeat the protections offered by redundancy.^2 These events fall into two categories. The first category is human error and the second is catastrophic failure.
In the category of human error, you have application bugs, deliberate hacking and accidental deletion or corruption of all data on the primary node. In all those cases, the errors introduced to the primary will propagate automatically to all members of your replica set, often within seconds! Given that human error is just as guaranteed as disk failure, this alone is enough reason to have backups.
In the category of catastrophic failure, this includes scenarios that permanently destroy all members of your replica set. If you keep all replica set servers in the same data center, a fire that destroys the servers would qualify. So would a disgruntled employee who goes out and deliberately deletes the data.
For these lower probability events, you want a relatively inexpensive solution that is well isolated from your production system. The MongoDB feature that enables protection against disasters is Backup.
All backup systems share the characteristic that they offer a snapshot of your system at a past moment in time. That this snapshot is forever frozen is a critical feature of the backup. Backup snapshots should be stored far from your production system, far both physically, logically and administratively. Restoring the backup snapshot rolls back the clock on the event that caused the loss of all data across your replica set.
Taking backups of MongoDB systems is incredibly easy, so you have no excuse for not developing a comprehensive backup strategy. MongoDB Cloud Manager, the hosted backup solution from MongoDB, Inc., provides a number of extra benefits.
Cloud Manager further reduces the window of data loss vulnerability by continually recording your replica set oplog and providing point-in-time recovery for the most recent 24 hours (in addition to keeping the point-in-time snapshots). This is the holy grail for backup: recovery to an arbitrary moment in time before the error occurred.
Cloud Manager is also separate from Amazon Web Services, a popular place to put production data. That means that failures that affect AWS will be uncorrelated to failures that might affect the Cloud Manager infrastructure (and we have our own levels of redundancy). That’s good for any insurance policy.
Restoring from backup (your own, or Cloud Manager) is not instantaneous. You incur some downtime. But because backup covers catastrophes, this is acceptable. If you are restoring from backup, your application is already so broken that the cost of the downtime is less than the cost of running continuously with bad or missing data.
To sum up: you want backup to cover the events that you hope to never see happen. (If it’s happening often, you need a new process.) Replication, on the other hand, offers fault tolerance against events that are fairly common and must be addressed without the user being aware of the event.
Long story short, you need backup and replication. They address a different set of risks.