Developing the MongoDB Backup Service, From Prototype to Production



In this series, the MMS team presents an in-depth look at how we developed the MongoDB backup capabilities in MMS. It should not surprise readers to learn that MMS Backup is a dogfood project… we use MongoDB in a few places, not just the obvious one. We thus have an opportunity to show in intimate detail how we ourselves implement the best practices we teach our community via our educational channels. 

In this opening post, we’ll look at how we approached the overall project, identifying the challenges we’d face and setting the roadmap from prototype to general release. In part two, we’ll cover the development of the client-side agent MMS Backup users drop into their infrastructure. In part three, we’ll look at the ingestion side at the MongoDB Mothership, and in parts four and five, we’ll cover how we evolved our storage technology from a simple file archive to a deduplicating blockstore using MongoDB as our underpinning technology.

Let’s get started with a quick overview of MMS Backup.

Defining the Service

MongoDB Management Service transparently replicates MongoDB instances to cloud storage, taking snapshots of each database, by default, every six hours. It maintains these snapshots for 2 days, and then begins aging them out, keeping dailies for a week, weeklies for a month, and monthlies for a year (this retention policy is configurable). It lets clients restore any of those snapshots, or to the state of the database at any point in time within the last 24 hours.

The Challenges

Given the above design specs, we faced these core challenges:

  1. Capacity planning without prior knowledge of our clients’ workloads.
  2. Providing clients a lightweight agent that installs and configures easily, and that doesn’t impact client production environments.
  3. Storing client data efficiently.

Capacity Planning

Capacity planning is hard even when you know a lot about how your application is used and how resources are consumed. Building a service meant to handle arbitrary client workloads is harder. We did know that for each of the replica sets we would back up, we would need certain resources, but the extent of our need each would be different for each client. For each replica set, we need: 

  1. Disk space to hold the database
  2. Disk space to hold 24 hours of oplogs, as well as extra space to hold more while snapshots were being taken
  3. Disk space to hold one snapshot every six hours for X months
  4. Network bandwidth to ingest the initial data and a constant stream of oplogs
  5. Disk IO for applying oplogs
  6. CPU cycles and RAM to run MongoDB apply oplogs to our copies of our clients’ data
  7. CPU cycles and RAM to compress data (an optimization)

A Lightweight Agent

By “transparently replicates”, I mean that MMS Backup is completely invisible to our clients’ infrastructure, requiring no configuration changes to their replica sets, and leaving no trace of the service in their status outputs. It is also fairly intangible, adding no more load than any other secondary replica set member would.

To accomplish this, we need to provide our clients with a software agent that is installed inside their infrastructure, with access to their MongoDB replica sets. The agent accesses their collections and oplogs directly, but conceptually works just like a hidden secondary. Installing this agent needs to be as easy as we can make it.

Efficient Data Storage

Our default snapshot retention policy is to maintain:

  • 6-hour interval snapshots for 2 days,
  • Daily snapshots stored for 1 week,
  • Weekly snapshots stored for 1 month,
  • and Monthly snapshots stored for 1 year.

If we have 100 clients, each with a single 1TB database, that makes 3,100 TB, without oplog storage. Clearly finding some way to optimize our use of disk space represents a special subset of challenges in our capacity planning.

Working on The Right Stuff First

Every project runs the risk of delivery delays, performance problems, and outright failure. The agile methodologies which have been refined in the past decades all share one principle: these risks are directly proportional to a degree of ignorance. It can be ignorance of anything, such as the resources required to handle load, the time required to develop a feature, or how well a feature will suit the targeted need. We mitigate risk by building working systems as early as possible, focusing on the areas of highest ignorance, and using the feedback from rapid iterations to reduce that risk.

This suggests that you start building with components you can use that will require the least amount of effort, with the greatest chance of delivering a functioning system. Once a system is working and feedback is flowing, you can begin to improve on the parts that are insufficient.

Defining Our Roadmap

With that framework in mind, we laid out what we needed to deliver our features and identified where these needs could not be met with existing components, and where the needs were least defined.

Network Bandwidth at Our Datacenter

While we could not know how much we would need, the solution – paying a network provider to bring capacity to you – is well known, and scaling up capacity can be done relatively quickly. This is therefore a low-risk area.

Overall Capacity Planning

There was great risk of harm to the project had this been improperly handled, but as described above, it was impossible to do before having some data from a running system. Thus, our approach here was to define the classes of missing knowledge, so we could gather it while in the prototyping, alpha, and beta stages, and be ready to apply what we learned. 

Efficient Storage

Deduplicating large quantities of highly redundant data is a well known problem class, so ignorance of how exactly we would tackle it introduced only moderate risk, in spite of it being very harmful to our infrastructure should we fail to address it properly. Since skipping this step would still produce a working system, we could make do with directories of *.tar.gz files and work on higher risk areas first.

The Agent

An improperly implemented agent would cause great harm to the backup system, and while the logic for replicating a client database was already completely described by the behavior of MongoDB secondaries, we could not produce a working system without this component. This meant that our very first task was to build the MMS Backup agent.

Stay Tuned

In our next piece, we’ll look at the path we took developing that agent, why we switched from Java to Go, and go into gritty detail regarding its workings.