Archiving and Compression in MongoDB Tools

Wisdom Omuya

#Technical

Image alt goes here

Introduction

My talk at MongoDB World 2015, "Putting the Go in MongoDB" focused on the rewrite of the MongoDB tools from C++ to Go and some of the usability and performance improvements that were realized, but I only briefly touched on two new features planned for the 3.2 release - archiving and compression.

In this post, I’ll provide a more detailed explanation of the archiving and compression features in mongodump and mongorestore, as well as explore some of the use cases that are now possible with these features.

Overview

A general purpose archive is composed of one or more files. An example of this is the tape archive format (tar) which consists of one or more files written in sequence. Archiving is particularly useful in applications that perform interprocess communications e.g. you can compress a directory on a remote server and pipe the tarball, over SSH, to localhost where it is decompressed:

ssh source.server.com tar c sourceDirectory | tar x

Since archives are created in a sequential manner, the receiving end can consume data as the sending end produces it.

In 3.0, we added the ability to concurrently backup and restore multiple collections from and to MongoDB, which allows you more fully utilize your disk I/O in performing backup operations. As a result, the data written to a mongodump archive is not necessarily received sequentially1. Similarly, mongorestore concurrently reads collections for restore operations and its read order is also not necessary sequential.

General purpose archive formats, like tar, only support contiguous file packing within the archive. Using these archive formats for mongodump and mongorestore will create an unacceptable performance degradation as data from all collections will have to be written to and read from, in order. To support the concurrent behavior of these tools, we developed a special purpose archive format that supports non-contiguous files writes. The new archiving feature provides major gains in the efficiency of backup and restore operations.

Background

In order to put archiving in context, let’s consider how you might ordinarily create a backup. As an example, say you have a “country” database with two collections - “nigeria” and “austria”. You might run2:

mongodump --db country

The command above reads data from all collections in the “country” database, and then writes it to the “dump” directory3. The command above produces the following directory listing:

dump/
└── [4.3M]  country
    ├── [2.1M]  austria.bson
    ├── [  87]  austria.metadata.json
    ├── [2.1M]  nigeria.bson
    ├── [  87]  nigeria.metadata.json
    └── [ 140]  system.indexes.bson
<p>1 directory, 5 files

You could also backup the entire server - the server here contains 2 databases (country and product).

mongodump
├── [5.4M]  dump
│   ├── [4.03M]  country
│   │   ├── [2.1M]  austria.bson
│   │   ├── [  87]  austria.metadata.json
│   │   ├── [2.1M]  nigeria.bson
│   │   ├── [  87]  nigeria.metadata.json
│   │   └── [ 140]  system.indexes.bson
│   └── [1.1M]  product
│       ├── [1.0M]  mongodump.bson
│       ├── [  89]  mongodump.metadata.json
│       └── [  72]  system.indexes.bson
2 directories, 8 files

or choose to backup a single collection to stdout instead of a directory:

mongodump --db country --collection nigeria --out -

Archiving Support

In 3.2, we’re introducing an additional mode of creating backups - the “archive” mode, which writes all the dump data, even from disparate databases and collections, to a single output file. Creating an archive is trivial using the mongodump - only an additional option is needed:

mongodump --db country --archive=country.archive
-rw-rw-r-- 1 wisdom wisdom 4.2M Jun 29 11:12 country.archive

The command above will create an archive of the ‘country’ database in the “country.archive” file. By default the archive is written to stdout. Unlike the directory mode of performing backups which creates a tree of directories, the result of running in archive mode is just a single file which contains all the data from the “country” database - all the collections, indexes, etc.

You can also backup a single collection or the contents of an entire server:

Single Collection:

mongodump --db country --collection nigeria --archive=nga.archive 
-rw-rw-r-- 1 wisdom wisdom 2.1M Jun 29 11:15 nga.archive

Entire Server:

mongodump --archive=server.archive
-rw-rw-r-- 1 wisdom wisdom 5.3M Jun 29 11:26 server.archive

In the case of mongodump, the archive mode allows multiple collections to be packed within the archive non-contiguously. Thus, in mongorestore, it allows (as already possible using the directory mode) for multiple collections to be restored in parallel. As a result, you can perform data migration over networks, reduce your disk I/O footprint, and take full advantage of the concurrency gains both in the tools and your underlying storage engine.

Data Migration

One of the new use cases that archiving improves, is interprocess communication between mongodump and mongorestore - specifically being able to pipe data from one to the other. In prior versions, there was limited support for this - you could only pipe one collection at a time. With archiving however, there are no such limits. This can be useful in environments where your database servers are firewalled for security purposes. In such cases, one common design is to allow database access from only one or a handful of servers. Using archives, data migration over SSH4 is trivial:

ssh wisdom@proxy.server.com mongodump --host source.server.com --archive  | ssh wisdom@target.server.com mongorestore --archive

The command above will SSH into a proxy host (proxy.server.com) that has access to the source server (source.server.com), run mongodump on the proxy server, sending the contents of the source server (over SSH) to a target server (target.server.com) for the final restore.

Without archiving, the only other way to accomplish these kinds of operations with mongodump, is to first perform a backup to disk, copy the files over to the target server, and then run mongorestore. With archiving, this is all accomplished in one command - without any additional disk I/O overhead.

Compression Support

In addition to archiving, we’ve also added support for compression using gzip. This is exposed by the introduction of a new command line option “--gzip” in both mongodump and mongorestore. Compression works both for backups created using the directory and the archive mode and reduces disk space usage.

mongodump --db country --gzip

produces:

dump/
└── [568K]  country
    ├── [254K]  austria.bson.gz
    ├── [ 100]  austria.metadata.json.gz
    ├── [254K]  nigeria.bson.gz
    ├── [ 100]  nigeria.metadata.json.gz
    └── [  91]  system.indexes.bson.gz
<p>1 directory, 5 files

Note that the size of the directory mode’s compressed backup - 568KB - is significantly less than that of the uncompressed backup - 4.3MB.

To compress an archive:

mongodump --db country --gzip --archive=country.archive
-rw-rw-r-- 1 wisdom wisdom 509K Jun 29 11:23 country.archive

For archives, the data is first compressed before it’s written into the archive.

To restore a compressed directory mode backup, you would run5:

mongorestore --gzip

A similar command can be used to restore a compressed archive mode backup:

mongorestore --gzip --archive=country.archive

You can also combine compression with archiving:

ssh wisdom@proxy.server.com mongodump --host source.server.com --archive --gzip  | ssh wisdom@target.server.com mongorestore --archive --gzip

The data migration happens without any disk I/O overhead and because of compression, utilizes less network bandwidth.

Summary

Archiving and compression open up a range of use cases for performing backup and restore operations. If you are using the MongoDB tools for these and other kinds of applications, we want to hear about your use case. You can try out these features in the current unstable release of the tools.

Note: As the only backup solutions that offer cluster-wide snapshots of sharded clusters, MongoDB Ops Manager and MongoDB Cloud Manager are recommended for larger MongoDB deployments.


Try Cloud Manager with our 30 day free trial:

Start free trial

1By default, mongodump will concurrently read data from various MongoDB collections and write it out
2 mongodump and mongorestore default to “localhost” as the host and “27017” as the port
3 The directory name can be changed using the “--out” option in mongodump
4You can also set up local/remote port forwarding (SSH tunneling)
5By default, mongorestore looks for data in the “dump” directory

About the Author - Wisdom Wisdom Omuya is a Software Engineer at MongoDB and has been working on the MongoDB tools and a distributed continuous integration system, Evergreen. Before MongoDB, Wisdom worked as a research assistant at the Center for Computational Learning Systems and before that, at the Central Bank of Nigeria. He has a BS in Software Engineering from the American University of Nigeria and a Masters in Computer Science from Columbia University.