New Compression Options in MongoDB 3.0

Asya Kamsky
April 30, 2015 | Updated: December 2, 2025

MongoDB 3.0 introduces compression with the WiredTiger storage engine. In this post we will take a look at the different options, and show some examples of how the feature works. As always, YMMV, so we encourage you to test your own data and your own application.

Why compression?

Everyone knows storage is cheap, right?

But chances are you’re adding data faster than storage prices are declining, so your net spend on storage is rising. Your internal costs might also incorporate management and other factors, so they may be significantly higher than commodity market prices. In other words, it still pays to look for ways to reduce your storage needs.

Size is one factor, and there are others. Disk I/O latency is dominated by seek time on rotational storage. By decreasing the size of the data, fewer disk seeks will be necessary to retrieve a given quantity of data, and disk I/O throughput will improve. In terms of RAM, some compressed formats can be used without decompressing the data in memory. In these cases more data can fit in RAM, which improves performance.

Storage properties of MongoDB

There are two important features related to storage that affect how space is used in MongoDB: BSON and dynamic schema.

MongoDB stores data in BSON, a binary encoding of JSON-like documents (BSON supports additional data types, such as dates, different types of numbers, binary). BSON is efficient to encode and decode, and it is easily traversable. However, BSON does not compress data, and it is possible its representation of data is actually larger than the JSON equivalent.

One of the things users love about MongoDB’s document data model is dynamic schema. In most databases, the schema is described and maintained centrally in a catalog or system tables. Column names are stored once for all rows. This approach is efficient in terms of space, but it requires all data to be structured according to the schema. In MongoDB there is currently no central catalog: each document is self-describing. New fields can be added to a document without affecting other documents, and without registering the fields in a central catalog.

The tradeoff is that with greater flexibility comes greater use of space. Field names are defined in every document. It is a best practice to use shorter field names when possible. However, it is also important not to take this too far – single letter field names or codes can obscure the field names, making it more difficult to use the data.

Fortunately, traditional schema is not the only way to be space efficient. Compression is very effective for repeating values like field names, as well as much of the data stored in documents.

There is no Universal Compression

Compression is all around us: images (JPEG, GIF), audio (mp3), video (MPEG), and most web servers compress web pages before sending to your browser using gzip. Compression algorithms have been around for decades, and there are competitions that award innovation.

Compression libraries rely on CPU and RAM to compress and decompress data, and each makes different tradeoffs in terms of compression rate, speed, and resource utilization. For example, one measure of today’s best compression library for text can compress 1GB of Wikipedia data to 124MB compared to 323MB for gzip, but it takes about almost 3,000 times longer and 30,000 times more memory to do so. Depending on your data and your application, one library may be much more effective for your needs than others.

MongoDB 3.0 introduces WiredTiger, a new storage engine that supports compression. WiredTiger manages disk I/O using pages. Each page contains many BSON documents. As pages are written to disk they are compressed by default, and when they are read into the cache from disk they are decompressed.

One of the basic concepts of compression is that repeating values – exact values as well as patterns – can be stored once in compressed form, reducing the total amount of space. Larger units of data tend to compress more effectively as there tend to be more repeating values. By compressing at the page level – commonly called block compression – WiredTiger can more efficiently compress data.

WiredTiger supports multiple compression libraries. You can decide which option is best for you at the collection level. This is an important option – your access patterns and your data could be quite different across collections. For example, if you’re using GridFS to store large documents such as images and videos, MongoDB automatically breaks the large files into many smaller “chunks” and reassembles them when requested. The implementation of GridFS maintains two collections: fs.files, which contains the metadata for the large files and their associated chunks, and fs.chunks, which contains the large data broken into 255KB chunks. With images and videos, compression will probably be beneficial for the fs.files collection, but the data contained in fs.chunks is probably already compressed, and so it may make sense to disable compression for this collection.

Compression options in MongoDB 3.0

In MongoDB 3.0, WiredTiger provides three compression options for collections:

No compression
Snappy (enabled by default) – very good compression, efficient use of resources
zlib (similar to gzip) – excellent compression, but more resource intensive

There are two compression options for indexes:

No compression
Prefix (enabled by default) – good compression, efficient use of resources

You may wonder why the compression options for indexes are different than those for collections. Prefix compression is fairly simple – the “prefix” of values is deduplicated from the data set. This is especially effective for some data sets, like those with low cardinality (eg, country), or those with repeating values, like phone numbers, social security codes, and geo-coordinates. It is especially effective for compound indexes, where the first field is repeated with all the unique values of second field. Prefix indexes also provide one very important advantage over Snappy or zlib – queries operate directly on the compressed indexes, including covering queries.

When compressed collection data is accessed from disk, it is decompressed in cache. With prefix compression, indexes can remain compressed in RAM. We tend to see very good compression with indexes using prefix compression, which means that in most cases you can fit more of your indexes in RAM without sacrificing performance for reads, and with very modest impact to writes. The compression rate will vary significantly depending on the cardinality of your data and whether you use compound indexes.

Some things to keep in mind that apply to all the compression options in MongoDB 3.0:

Random data does not compress well
Binary data does not compress well (it may already be compressed)
Text compresses especially well
Field names compress well in documents (the additional benefits of short field names are modest)

Compression is enabled by default for collections and indexes in the WiredTiger storage engine. To explicitly set the compression for the replica at startup, specify the appropriate options in the YAML config file. use the command line option --wiredTigerCollectionBlockCompressor. Because WiredTiger is not the default storage engine in MongoDB 3.0, you’ll also need to specify the --storageEngine option to use WiredTiger and take advantage of these compression features.

To specify compression for specific collections, you’ll need to override the defaults by passing the appropriate options in the db.createCollection() command. For example, to create a collection called email using the zlib compression library:

db.createCollection( "email", { storageEngine: {
                       wiredTiger: { configString: "blockCompressor=zlib" }}})

How to measure compression

The best way to measure compression is to separately load the data with and without compression enabled, then compare the two sizes. The db.stats() command returns many different storage statistics, but the two that matter for this comparison are storageSize and indexSize. Values are returned in bytes, but you can convert to MB by passing in 1024*1024:

> db.stats(1024*1024).dataSize + db.stats(1024*1024).indexSize
1406.9201011657715

This is the method we used for the comparisons provided below.

Testing compression on different data sets

Let’s take a look at some different data sets to see how some of the compression options perform. We have four databases:

Enron
This is the infamous Enron email corpus. It includes about a half million emails. There’s a great deal of text in the email body fields, and some of the metadata has low cardinality, which means that they’re both likely to compress well. Here’s an example (the email body is truncated):

{
	"_id" : ObjectId("4f16fc97d1e2d32371003e27"),
	"body" : "",
	"subFolder" : "notes_inbox",
	"mailbox" : "bass-e",
	"filename" : "450.",
	"headers" : {
		"X-cc" : "",
		"From" : "michael.simmons@enron.com",
		"Subject" : "Re: Plays and other information",
		"X-Folder" : "\\Eric_Bass_Dec2000\\Notes Folders\\Notes inbox",
		"Content-Transfer-Encoding" : "7bit",
		"X-bcc" : "",
		"To" : "eric.bass@enron.com",
		"X-Origin" : "Bass-E",
		"X-FileName" : "ebass.nsf",
		"X-From" : "Michael Simmons",
		"Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)",
		"X-To" : "Eric Bass",
		"Message-ID" : "<6884142.1075854677416.JavaMail.evans@thyme>",
		"Content-Type" : "text/plain; charset=us-ascii",
		"Mime-Version" : "1.0"
	}
}

Here’s how the different options performed with the Enron database:

Flights
The US Federal Aviation Administration (FAA) provides data about on-time performance of airlines. Each flight is represented as a document. Many of the fields have low cardinality, so we express this data set to compress well:

{
	"_id" : ObjectId("53d81b734aaa3856391da1fb"),
	"origin" : {
		"airport_seq_id" : 1247802,
		"name" : "JFK",
		"wac" : 22,
		"state_fips" : 36,
		"airport_id" : 12478,
		"state_abr" : "NY",
		"city_name" : "New York, NY",
		"city_market_id" : 31703,
		"state_nm" : "New York"
	},
	"arr" : {
		"delay_group" : 0,
		"time" : ISODate("2014-01-01T12:38:00Z"),
		"del15" : 0,
		"delay" : 13,
		"delay_new" : 13,
		"time_blk" : "1200-1259"
	},
	"crs_arr_time" : ISODate("2014-01-01T12:25:00Z"),
	"delays" : {
		"dep" : 14,
		"arr" : 13
	},
	"taxi_in" : 5,
	"distance_group" : 10,
	"fl_date" : ISODate("2014-01-01T00:00:00Z"),
	"actual_elapsed_time" : 384,
	"wheels_off" : ISODate("2014-01-01T09:34:00Z"),
	"fl_num" : 1,
	"div_airport_landings" : 0,
	"diverted" : 0,
	"wheels_on" : ISODate("2014-01-01T12:33:00Z"),
	"crs_elapsed_time" : 385,
	"dest" : {
		"airport_seq_id" : 1289203,
		"state_nm" : "California",
		"wac" : 91,
		"state_fips" : 6,
		"airport_id" : 12892,
		"state_abr" : "CA",
		"city_name" : "Los Angeles, CA",
		"city_market_id" : 32575
	},
	"crs_dep_time" : ISODate("2014-01-01T09:00:00Z"),
	"cancelled" : 0,
	"unique_carrier" : "AA",
	"taxi_out" : 20,
	"tail_num" : "N338AA",
	"air_time" : 359,
	"carrier" : "AA",
	"airline_id" : 19805,
	"dep" : {
		"delay_group" : 0,
		"time" : ISODate("2014-01-01T09:14:00Z"),
		"del15" : 0,
		"delay" : 14,
		"delay_new" : 14,
		"time_blk" : "0900-0959"
	},
	"distance" : 2475
}

Here’s how the different options performed with the Flights database:

MongoDB Config Database
This is the metadata MongoDB stores in the config database for sharded clusters. The manual describes the various collections in that database. Here’s an example from the chunks collection, which stores a document for each chunk in the cluster:

{
   "_id" : "mydb.foo-a_\"cat\"",
   "lastmod" : Timestamp(1000, 3),
   "lastmodEpoch" : ObjectId("5078407bd58b175c5c225fdc"),
   "ns" : "mydb.foo",
   "min" : {
         "animal" : "cat"
   },
   "max" : {
         "animal" : "dog"
   },
   "shard" : "shard0004"
}

Here’s how the different options performed with the config database:

TPC-H
TPC-H is a classic benchmark used for testing relational analytical DBMS. The schema has been modified to use MongoDB’s document model. Here’s an example from the orders table with only the first of many line items displayed for this order:

{
	"_id" : 1,
	"cname" : "Customer#000036901",
	"status" : "O",
	"totalprice" : 173665.47,
	"orderdate" : ISODate("1996-01-02T00:00:00Z"),
	"comment" : "instructions sleep furiously among ",
	"lineitems" : [
		{
			"lineitem" : 1,
			"mfgr" : "Manufacturer#4",
			"brand" : "Brand#44",
			"type" : "PROMO BRUSHED NICKEL",
			"container" : "JUMBO JAR",
			"quantity" : 17,
			"returnflag" : "N",
			"linestatus" : "O",
			"extprice" : 21168.23,
			"discount" : 0.04,
			"shipinstr" : "DELIVER IN PERSON",
			"realPrice" : 20321.5008,
			"shipmode" : "TRUCK",
			"commitDate" : ISODate("1996-02-12T00:00:00Z"),
			"shipDate" : ISODate("1996-03-13T00:00:00Z"),
			"receiptDate" : ISODate("1996-03-22T00:00:00Z"),
			"tax" : 0.02,
			"size" : 9,
			"nation" : "UNITED KINGDOM",
			"region" : "EUROPE"
		}
	]
}

Here’s how the different options performed with the TPC-H database:

Twitter
This is a database of 200K tweets. Here’s a simulated tweet introducing our Java 3.0 driver:

{
  "coordinates": null,
  "created_at": "Fri April 25 16:02:46 +0000 2010",
  "favorited": false,
  "truncated": false,
  "id_str": "",
  "entities": {
    "urls": [
      {
        "expanded_url": null,
        "url": "http://mongodb.com",
        "indices": [
          69,
          100
        ]
      }
    ],
    "hashtags": [ ],
    "user_mentions": [
      {
        "name": "MongoDB",
        "id_str": "",
        "id": null,
        "indices": [
          25,
          30
        ],
        "screen_name": "MongoDB"
      }
    ]
  },
  "in_reply_to_user_id_str": null,
  "text": "Introducing the #Java 3.0 driver for #MongoDB http://buff.ly/1DmMTKu",
  "contributors": null,
  "id": null,
  "retweet_count": 12,
  "in_reply_to_status_id_str": null,
  "geo": null,
  "retweeted": true,
  "in_reply_to_user_id": null,
  "user": {
    "profile_sidebar_border_color": "C0DEED",
    "name": "MongoDB",
    "profile_sidebar_fill_color": "DDEEF6",
    "profile_background_tile": false,
    "profile_image_url": "",
    "location": "New York, NY",
    "created_at": "Fri April 25 23:22:09 +0000 2008",
    "id_str": "",
    "follow_request_sent": false,
    "profile_link_color": "",
    "favourites_count": 1,
    "url": "http://mongodb.com",
    "contributors_enabled": false,
    "utc_offset": -25200,
    "id": null,
    "profile_use_background_image": true,
    "listed_count": null,
    "protected": false,
    "lang": "en",
    "profile_text_color": "",
    "followers_count": 159678,
    "time_zone": "Eastern Time (US & Canada)",
    "verified": false,
    "geo_enabled": true,
    "profile_background_color": "",
    "notifications": false,
    "description": "Community conversation around the MongoDB software. For official company news, follow @mongodbinc.",
    "friends_count": ,
    "profile_background_image_url": "",
    "statuses_count": 7311,
    "screen_name": "MongoDB",
    "following": false,
    "show_all_inline_media": false
  },
  "in_reply_to_screen_name": null,
  "source": "web",
  "place": null,
  "in_reply_to_status_id": null
}

Here’s how the different options performed with the Twitter database:

Comparing compression rates

The varying sizes of these databases make them difficult to compare side by side in terms of absolute size. We can take a closer look at the benefits by comparing the storage savings provided by each option. To do this, we compare the size of each database using Snappy and zlib to the uncompressed size in WiredTiger. As above, we’re adding the value of storageSize and indexSize.

Another way some people describe the benefits of compression is in terms of the ratio of the uncompressed size to the compressed size. Here’s how Snappy and zlib perform across the five databases.

How to test your own data

There are two simple ways for you to test compression with your data in MongoDB 3.0.

If you’ve already upgraded to MongoDB 3.0, you can simply add a new secondary to your replica set with the option to use the WiredTiger storage engine specified at startup. While you’re at it, make this replica hidden with 0 votes so that it won’t affect your deployment. This new replica set member will perform an initial sync with one of your existing secondaries. After the initial sync is complete, you can remove the WiredTiger replica from your replica set then connect to that standalone to compare the size of your databases as described above. For each compression option you want to test, you can repeat this process.

Another option is to take a mongodump of your data and use that to restore it into a standalone MongoDB 3.0 instance. By default your collections will use the Snappy compression option, but you can specify different options by first creating the collections with the appropriate setting before running mongorestore, or by starting mongod with different compression options. This approach has the advantage of being able to dump/restore only specific databases, collections, or subsets of collections to perform your testing.

For examples of syntax for setting compression options, see the section “How to use compression.”

A note on capped collections

Capped collections are implemented very differently in the MMAP storage engines as compared to WiredTiger (and RocksDB). In MMAP space is allocated for the capped collection at creation time, whereas for WiredTiger and RocksDB space is only allocated as data is added to the capped collection. If you have many empty or mostly-empty capped collections, comparisons between the different storage engines may be somewhat misleading for this reason.

If you’re considering updating your version of MongoDB, take a look at our Major Version Upgrade consulting services:

About the Author - Asya

Asya is Lead Product Manager at MongoDB. She joined MongoDB as one of the company's first Solutions Architects. Prior to MongoDB, Asya spent seven years in similar positions at Coverity, a leading development testing company. Before that she spent twelve years working with databases as a developer, DBA, data architect and data warehousing specialist.

← Previous

Meteor: Build iOS and Android Apps that are a Delight to Use

Meteor provides a complete open source platform for building web and mobile apps in pure JavaScript. The Meteor team chose MongoDB as its datastore for its performance, scalability, and rich features for JSON. Meteor apps run using JavaScript via Node.JS on the server and JavaScript in the phone’s browser. You write both the client and the server sides of your application in JavaScript using the Meteor.JS framework. What’s so great about Meteor? Here’s a few things that are particularly interesting to MongoDB developers: Productivity Through Simplicity. Through reactivity and intelligent implication, Meteor requires much less code than other frameworks to get the same tasks done. One Language. You write the client and server portions of your application in the same JavaScript language using the same framework. You can even place both in one file if you like. Database Everywhere. The same methods access your MongoDB database on the server or on the phone. Data on the Wire. Meteor does not send HTML over the network. The server sends data and the client renders it. Latency Compensation. On the client, Meteor prefetches data from a local database to make it look like server method calls return instantly. Full Stack Reactivity. All layers update themselves as data changes without any additional code. Open Ecosystem. Open Source Meteor integrates with other open source tools and frameworks The Database Everywhere and Latency Compensation concepts are especially interesting. Meteor maintains a local copy of data by implementing its own miniMongo database in phone memory. All of the phone-to-server data communication and synchronization is provided by Meteor. miniMongo is a JavaScript implementation of the MongoDB API. With the automatic syncing provided by Meteor, you don’t need to have one set of developers working in one framework for the phone, and another set of developers working in another framework for the server. Now you can have full stack developers use one language and one framework. The phone code and the server code reside in the same JavaScript file. You build and deploy with a single command. To show you the amazing simplicity of Meteor, let me show you the implementation of a barebones app that does this on a phone: <img class="img-responsive" "src="http://s3.amazonaws.com/info-mongodb-com/_com_assets/blog/meteor/image00.png" alt=""> This HTML presents the page: Meteor Blaze is a powerful library for creating live-updating user interfaces. The Blaze UI responds to messages from objects inside the <template> object. Blaze replaces {{counter}} with the value of the counter value in the JavaScript section. Intelligent “reactivity” will update the counter variable as it changes. Now for the JavaScript. Note how the client and server code reside in a single file: Blaze replaces frameworks such as Angular, Backbone, Ember, React or Knockout. Blaze updates HTML templates automatically without directives, model classes, data dependency declarations, or rendering functions. Blaze infers the data dependency of arbitrary JavaScript, sets up callbacks to detect changes in the template’s data sources, recompute values, and changes the DOM in place with the update. Meteor calls this feature “reactivity.” You just write simple HTML with variable names enclosed in double brackets as we did with {{counter}} in the HTML above. Blaze lets you create rich phone user interfaces as in these screens from the Verso app built with Meteor and MongoDB: The same codebase that powers the phone app can power a website for a PC or Mac in a browser. Employees can use their phones to run an app while managers use their desktops for reporting. For example, the same codebase that creates the phone interfaces above presents these web pages in a browser running on a PC or Mac desktop computer: You can install and use Meteor for free . That link has excellent step by step tutorial showing you how to create a simple “todo” app that uses MongoDB for the datastore. At the end of the tutorial you can install a more elaborate sample todo app or a more complex customer engagement app that shows off native phone hardware functionality and social features. Meteor even provides a free server sandbox to deploy and test your apps. If you’re interested in learning more about the architecture of MongoDB, download our guide: Download the Architecture Guide

April 27, 2015

Next →

MongoDB Announces Leadership Transition

Dev Ittycheria, President and Chief Executive Officer, shared the following message with MongoDB employees this morning. This is the hardest email I have ever had to write to all of you. If you have not seen the announcement, I have decided to retire as CEO. Effective November 10, 2025, Chirantan “CJ” Desai will become the new CEO of MongoDB. This was not an easy decision for me. The process to get to this point has been deeply emotional, as I care profoundly about MongoDB and the people who have made the company what it is today. This news may come as a surprise, and for some, perhaps even a shock. That’s natural. Leadership transitions can evoke a range of reactions. I want to share why this is happening, and why it’s the right thing for MongoDB. Every personnel change, including the most senior leadership changes, involves two key decisions: first, recognizing that it is the right time for change, and second, selecting the best person to replace the person leaving. This email is intended to explain both decisions. Earlier this year, as part of our regular succession planning process, the Board and I discussed my long-term commitment. They asked if I would continue as CEO for another five years. After many conversations with my family and the Board, I realized I could not make that commitment. Some CEOs see their title as their identity. I do not. My core responsibility is to serve in the company's best interests. The company is primed for a new leader. One with a fresh perspective, grounded in experience and skills needed to guide MongoDB through its next evolution as a company, what we call MongoDB 3.0. Consequently, I informed the Board that I would commit to two more years to help find a successor. That began the search process for a suitable successor. To our surprise and delight, what we thought would easily take 12 to 24 months happened much faster than anyone expected. After engaging with multiple qualified candidates, we found the right successor in CJ. CJ is uniquely qualified for this role. CJ brings the rare growth-at-scale experience that will help continue to build MongoDB into an iconic technology company. At ServiceNow, he was the only executive to work directly with three of its highly regarded public company CEOs and played a pivotal role in organically scaling the company from just over $1 billion to more than $10 billion in revenue. Only a handful of independent software companies have ever reached that milestone. CJ helped transform ServiceNow from a product company to a platform company, scaled engineering, drove go-to-market excellence, and engaged deeply with investors. More recently, as President of Product and Engineering at Cloudflare, he helped fuel strong growth and stock performance. CJ also possesses the personal qualities needed to succeed as CEO. He is humble, eager to learn, and wants to draw on the perspectives of the people at MongoDB and other stakeholders to inform his thinking. This blend of experience, judgment, and character gives me full confidence that he is well-equipped to lead MongoDB through its next phase of growth. I often think of MongoDB’s journey as a long and extraordinary expedition. For the past eleven years, I have had the privilege of serving as its guide, helping chart the course, rally the team, and climb together through both calm and challenging terrain. Along the way, we have reached remarkable summits and proven what is possible through relentless innovation, persistence, and teamwork. Now it is time for a new guide to lead the next stage of the ascent and take MongoDB to even greater heights. CJ is the right leader to take MongoDB to the next summit. MongoDB is on a strong footing, with a clear strategy, an exceptional leadership team, a product platform that is more relevant than ever, and a business that is executing well. The rise of AI and the explosion of data-intensive applications play directly to MongoDB’s strengths. Our technology sits at the center of how modern applications are built and how organizations will harness data to power intelligent, adaptive systems. I am confident MongoDB is perfectly positioned to capture this next wave of innovation. As for me, I am not running away from MongoDB or leaving to join another company as CEO. I will remain on the Board and work closely with CJ to ensure a seamless transition. Over the years, this role has demanded an enormous amount of focus and energy; as a result, there are many things I’ve missed doing along the way. I’m looking forward to being more present for those moments — from simple time with my family to experiences and travel we’ve long put off. I plan to hold on to my MongoDB stock, as I firmly believe in the people and the opportunity, knowing that MongoDB’s best days are ahead of it. Yes, change can be unsettling. I’m sure you will have many questions about this change, such as why now, why CJ is the best person to lead the company, and what this means for you. We will hold an all-hands meeting tomorrow at 10:30AM ET to discuss this transition, introduce CJ and take your questions. That being said, I want to emphasize that the right change at the right time is how great companies get stronger. Just as a championship team refreshes its roster to stay competitive, MongoDB is bringing in new leadership, including other recent C-suite leaders who came before CJ, to drive our next phase of growth. This is not an ending; it’s the founding of a new moment. I am incredibly proud of what we have built together and genuinely excited about what lies ahead with CJ leading us forward. I also want to thank each of you for making this journey so meaningful. Words cannot fully capture my gratitude for your passion, creativity, and belief in building something truly special. I have often said that I want MongoDB to be an inflection point in people’s careers, a place where they can grow, take risks, and do the best work of their lives. I can say without hesitation that it has been exactly that for me. The skills I have developed, the experiences I have gained, and the relationships I have formed here have shaped me more than any other chapter in my professional life. I will carry them with me always, and will continue to cheer for and support MongoDB every step of the way. --Dev

November 3, 2025