Developing the MongoDB Backup Service: The Backup Agent
Welcome back to our series on the creation and workings of the MongoDB Backup Service in
MMS
. In
part one
, we reviewed the manner in which we assessed the overall challenges we would face building the system, as well as our thinking in drawing up the roadmap. In part two, we cover the client-side agent responsible for transparently streaming the client’s replica set updates back to the MMS cloud for backup. We’ll also see how MongoDB’s use of the document model to keep its replication data, rather than a separate binary format, cut down on the code required to implement it.
Our First Need(s)
To get a prototype of the backup service up and running, we really only needed two things: the agent, and the service to which it hands off client data (“ingestion”). Everything could be run at testing-scale locally, in our datacenters, or the cloud, and as long as we had those components, we had a working prototype.
To separate responsibilities, we designed the backup agent to care only about the logic of obtaining the data from customer replica sets, and to be responsible for sensibly batching up the data for network transmission. Its responsibility would end when it transferred that data to the ingestion service in the MMS cloud, which would ensure the correctness of the backups, and handle any batching and pipelining beyond that point. To limit exposure, the ingestion service is the only component of the MMS cloud which extends beyond the firewall.
In practice, the ingestion service and agent were developed together, since the agent’s behavior could not be validated without having an endpoint for it to send its data to. However, this post will focus on the development of the agent.
Requirements Review
The agent could not require configuration within a replica set, nor be visible in its status. It had to install easily, minimizing administrative hassle. It had to make efficient use of resources, and put no more load on the replica set than a
hidden secondary
would.
Language
Our first decision to make was what language to use. We wanted an expressive, high level language, with lots of available libraries. Our choice also needed to have great support for concurrency. We had used Python for the MMS Monitoring agent, and it was great to develop with, but we had learned that there were some disadvantages to using it. Clients had to manage Python versions as well as make sure they install the right libraries, adding to their administrative burden.
What we needed was a language which minimized or eliminated dependencies on packages, and did not require client-side compilation.
Java
Our first choice to meet those needs was Java. Indeed, we had a working Java client by the time the MMS Backup project hit alpha. Java is a great choice for concurrency, runs cross platform, and would not require library management as all the needed libraries would be packaged into a JAR. However, it did require administrators to install and maintain a JVM, rather than being a single executable that clients could download and run.
Lets Go!
MongoDB engineering had wanted for some time to find the right project to use Go on, and this really fit the bill. We’ve already
posted about why Go was the right choice for this
, and
Sam Helman has done an in-depth talk about it
(slides
here
). The top-level points are:
Go compiles to native binaries, including all libraries rather than linking against them. With no dependencies on runtimes or libraries, we could offer a simple executable download.
Go is strongly idiomatic
Go has an excellent
MongoDB driver
Go has excellent linguistic support for concurrency
How the Agent Works
If you understand
MongoDB’s replication process
, you pretty much understand the data reading portion of the agent. At a high level, it is very simple, and similar to any replication of a running datastore. Note the time, copy all the data in the database as of that time, and then apply all the operations accumulated in the
oplog
since beginning the data copy, continuing indefinitely. The main difference between the backup agent and another replica set member is that the agent accesses the collections as a client, so as to be invisible to the replica set config.
Let’s have a quick look at that oplog.
Oplog Details
The
oplog
is a built-in
capped collection
in the
local
db called
oplog.rs
, where data-altering operations are written. This is a document from the oplog collection:
{
"ts" : Timestamp(1393004960, 1),
"h" : NumberLong("6307784890218124205"),
"v" : 2,
"op" : "i",
"ns" : "foo.docs",
"o" :
{
"_id" : ObjectId("530791a09b493606b3000068"),
"name" : "a document"
}
}
The above record is for the insertion of
{"name": "a document"}
into the
docs
collection of the
foo
database. It specifies a timestamp, a hash of the operation, an oplog format version, the operation type, the namespace (either the database and collection, or
<db>.$cmd
for things like index creation), and the details of the operation. It is the fundamental unit of data replication. It contains a record of every insert, update, and delete on documents and indices.
Having update operations stored in this way makes it very simple to write code to process them without having to worry about parsing a binary format. We can also reuse most of code that batches, compresses, and transmits the collection data for an initial sync (which we will get to in a moment) to process these updates.
Instructions From the MMS Cloud
One of our design decisions was to keep as much of the higher level logic for correct backups in one place. To that end, the backup agent is only given responsibility for operations which require accessing the client’s data directly, under the remote guidance of the backup service. It does not store any state data. When it starts a new tailing routine (part of the agent’s processing pipeline, which we will get to shortly), and at regular intervals, it makes a conf call to the MMS Backup cloud asking what to do.
Ongoing Backups and Start Up
In the general case, the answer to an agent’s conf call contains no instructions for new behavior. The backup agent just continues sending over the documents from the oplog.
When the agent makes its conf call on startup, the response contains the timestamp (
cloud_timestamp
) of the last oplog document received by the MMS cloud, as well as a hash of that document. The agent requests from the primary
mongod
in the replica set a
tailable cursor
on the oplog collection for documents such that
{'ts': {'$gte': cloud_timestamp}}
. At that point it’s almost ready to start streaming the documents in that cursor back to the MMS cloud.
Error Cases
The first document in this cursor should be the last document the MMS cloud has already received. It is not sent to the cloud for replication, as it should have been already! Rather, it is used for error detection. Before it commences streaming, the agent verifies that the hash of the found document is the same as the hash returned by the config call, to ensure they are the same document.
If that document is not the same as the one in the MMS cloud, then it means one of two things. Either the agent has been down for so long that the replica set’s oplog has
rolled over
, in which case a full resync of the data is required, or the replica set had to perform a
rollback
following a failover, which can usually be recovered from without a resync.
Recovery can be done using the copy of the client’s oplogs that the MMS cloud maintains for 24 hours. In most cases, rollbacks are detected quickly, so the backup service will have largely valid data in its copy. The service just needs to find a point in its copy which agrees with the client’s oplog, return its replica to that point, and resume from there.
When a MongoDB replica set experiences failover, all clients are disconnected, including the backup agent’s tailing routine. The agent restarts its tailing routine and makes a conf call specifying the error condition. It will execute the MMS cloud’s requests to find earlier points in the oplog, and if it finds one of them, it will be asked to resume tailing from that point.
If such common point cannot be found, the MMS cloud will request a full sync.
The Agent Pipeline
All of the functionality we have covered so far belongs to the oplog tailing routine, one of three asynchronously executing
goroutines
the agent uses to segment use of resources. Oplog tailing reads from a remote mongodb and keeps the results in memory, so it primarily makes use of network resources, with an endpoint at the customer’s database. The other two routines are: slicing and compression, which is CPU intensive, and shipping the slices to ingestion, which also makes use of network resources, but with an endpoint at the MMS cloud.
The goroutines communicate with each other using
channels
in a producer-consumer pattern. The oplog tailing routine just reads documents from the oplog and puts them on a channel, where it is picked up by the slicing and compression routine. That routine stacks the oplog documents into slices, which accumulate until the they reach 10MB, or one minute of wall clock time has elapsed. When a slice is complete, the routine compresses it and puts it onto another channel, where it is found by the network communication routine. That routine uses a RESTful protocol to transmit these compressed slices to the ingestion service.
Initial Syncing
In some cases, the customer has started additional replica sets or clusters to be backed up in the service, in which case a conf call by the agent to the MMS cloud will also answer with a request to run an initial sync for those replica sets.
If the agent is asked to begin backing up new replica sets, it starts up a separate sync goroutine, which connects to a replica set member and captures the timestamp of the last oplog entry. (By default, this is a secondary member, and while oplog entries themselves are always read from the primary, the agent gets the window of critical operations from whichever instance it syncs from.) It then iterates through all the databases, gathering the collections and indices to back up, and sends the names of these back to MMS immediately.
For each of the collections, the sync routine gets a
natural order cursor
and begins reading the documents from the collection. Then, just like the tailing routine, it puts the documents into the channel that the slicing and compressing routine reads from. When the sync is done, the MMS cloud can begin playing the oplog, catching up on any changes made since the sync started.
The initial sync requires high-level logic to cope with cases when a document moves position on disk while its collection is being synced. As with the rest of MMS backup, the agent does not implement this logic, it simply fetches documents the MMS cloud asks for after the initial sync pass. We will devote the next post to the ins-and-outs of that process.
The Clustershot
One of the great features of MMS Backup is its ability to easily
capture a chunk-consistent snapshot of a sharded cluster
. (Hence “clustershot”. This is not a type of ammunition.) Periodically, the MMS cloud answers an agent’s config call with an added request to create a checkpoint in each replica set, which enables this feature.
To execute this request, the agent instructs a
mongos
in the cluster to halt
the balancer
. Without this step, snapshots of the individual replica sets within a cluster might be taken while a chunk is being migrated,
resulting in duplicated or missing data
. Once the balancer stops, the agent writes a token into the oplog of every replica set and config server in the cluster, after which the balancer is re-enabled. This token is the synchronization point, creating a logical “moment” which can be restored across all shards while remaining consistent with the chunks layout of a sharded cluster. The backup service uses these tokens to pause oplog application across the shards and take their snaphots.
Up Next: Syncing
In this post we’ve covered the overall behavior of the backup agent, but we’ve skipped over the really interesting logic required to perform an initial sync. Next time we’ll go into that in detail.
June 5, 2014