Asya Kamsky

5 results

Scaling Your Replica Set: Non-Blocking Secondary Reads in MongoDB 4.0

MongoDB 4.0 adds the ability to read from secondaries while replication is simultaneously processing writes. To see why this is new and important let's look at secondary read behavior in versions prior to 4.0. Background From the outset MongoDB has been designed so that when you have sequences of writes on the primary, each of the secondary nodes must show the writes in the same order. If you change field "A" in a document and then change field "B", it is not possible to see that document with changed field "B" and not changed field "A". Eventually consistent systems allow you to see it, but MongoDB does not, and never has. On secondary nodes, we apply writes in batches, because applying them sequentially would likely cause secondaries to fall behind the primary. When writes are applied in batches, we must block reads so that applications cannot see data applied in the "wrong" order. This is why when reading from secondaries, periodically the readers have to wait for replication batches to be applied. The heavier the write load, the more likely that your secondary reads will have these occasional "pauses", impacting your latency metrics. Given that applications frequently use secondary reads to reduce the latency of queries (for example when they use "nearest" readPreference ) having to wait for replication batches to be applied defeats the goal of getting lowest latency on your reads. In addition to readers having to wait for replication batch writes to finish, the writing of batches needs a lock that requires all reads to complete before it can be taken. That means that in the presence of high number of reads, the replication writes can start lagging – an issue that is compounded when chain replication is enabled. What was our goal in MongoDB 4.0? Our goal was to allow reads during oplog application to decrease read latency and secondary lag, and increase maximum throughput of the replica set. For replica sets with a high write load, not having to wait for readers between applying oplog batches allows for lower lag and quicker confirmation of majority writes, resulting in less cache pressure on the primary and better performance overall. How did we do it? Starting with MongoDB 4.0 we took advantage of the fact that we implemented support for timestamps in the storage engine, which allows transactions to get a consistent view of data at a specific "cluster time". For more details about this see the video: WiredTiger timestamps . Secondary reads can now also take advantage of the snapshots, by reading from the latest consistent snapshot prior to the current replication batch that's being applied. Reading from that snapshot guarantees a consistent view of the data, and since applying current replication batch doesn't change these earlier records, we can now relax the replication lock and allow all these secondary reads at the same time the writes are happening. How much difference does this make? A lot! The range of performance improvements for throughput could range from none (if you were not impacted by the replication lock - that is your write load is relatively low) to 2X. Most importantly, this improves latency for secondary reads – for those who use readPreference "nearest" because they want to reduce latency from the application to the database – this feature means their latency in the database will also be as low as possible. We saw significant improvement in 95 and 99th percentile latency in these tests. Thread levels 8 16 32 64 Feature off 1 2 3 5 Feature on 0 1 1 0 95th percentile read latency (ms) Best part of this new feature? You don't need to do anything to enable it or opt-into it. All secondary reads in 4.0 will read from snapshot without waiting for replication writes. This is just one of a number of great new features coming in MongoDB 4.0. Take a look at our blog on the 4.0 release candidate to learn more . And don’t forget, you’ve still got time to register for MongoDB World where you can meet with the engineers who are building all of these great new features.

June 14, 2018

New Compression Options in MongoDB 3.0

MongoDB 3.0 introduces compression with the WiredTiger storage engine. In this post we will take a look at the different options, and show some examples of how the feature works. As always, YMMV, so we encourage you to test your own data and your own application. Why compression? Everyone knows storage is cheap, right? But chances are you’re adding data faster than storage prices are declining, so your net spend on storage is rising. Your internal costs might also incorporate management and other factors, so they may be significantly higher than commodity market prices. In other words, it still pays to look for ways to reduce your storage needs. Size is one factor, and there are others. Disk I/O latency is dominated by seek time on rotational storage. By decreasing the size of the data, fewer disk seeks will be necessary to retrieve a given quantity of data, and disk I/O throughput will improve. In terms of RAM, some compressed formats can be used without decompressing the data in memory. In these cases more data can fit in RAM, which improves performance. Storage properties of MongoDB There are two important features related to storage that affect how space is used in MongoDB: BSON and dynamic schema. MongoDB stores data in BSON, a binary encoding of JSON-like documents (BSON supports additional data types, such as dates, different types of numbers, binary). BSON is efficient to encode and decode, and it is easily traversable. However, BSON does not compress data, and it is possible its representation of data is actually larger than the JSON equivalent. One of the things users love about MongoDB’s document data model is dynamic schema. In most databases, the schema is described and maintained centrally in a catalog or system tables. Column names are stored once for all rows. This approach is efficient in terms of space, but it requires all data to be structured according to the schema. In MongoDB there is currently no central catalog: each document is self-describing. New fields can be added to a document without affecting other documents, and without registering the fields in a central catalog. The tradeoff is that with greater flexibility comes greater use of space. Field names are defined in every document. It is a best practice to use shorter field names when possible. However, it is also important not to take this too far – single letter field names or codes can obscure the field names, making it more difficult to use the data. Fortunately, traditional schema is not the only way to be space efficient. Compression is very effective for repeating values like field names, as well as much of the data stored in documents. There is no Universal Compression Compression is all around us: images (JPEG, GIF), audio (mp3), video (MPEG), and most web servers compress web pages before sending to your browser using gzip. Compression algorithms have been around for decades, and there are competitions that award innovation . Compression libraries rely on CPU and RAM to compress and decompress data, and each makes different tradeoffs in terms of compression rate, speed, and resource utilization. For example, one measure of today’s best compression library for text can compress 1GB of Wikipedia data to 124MB compared to 323MB for gzip, but it takes about almost 3,000 times longer and 30,000 times more memory to do so. Depending on your data and your application, one library may be much more effective for your needs than others. MongoDB 3.0 introduces WiredTiger, a new storage engine that supports compression. WiredTiger manages disk I/O using pages. Each page contains many BSON documents. As pages are written to disk they are compressed by default, and when they are read into the cache from disk they are decompressed. One of the basic concepts of compression is that repeating values – exact values as well as patterns – can be stored once in compressed form, reducing the total amount of space. Larger units of data tend to compress more effectively as there tend to be more repeating values. By compressing at the page level – commonly called block compression – WiredTiger can more efficiently compress data. WiredTiger supports multiple compression libraries. You can decide which option is best for you at the collection level. This is an important option – your access patterns and your data could be quite different across collections. For example, if you’re using GridFS to store large documents such as images and videos, MongoDB automatically breaks the large files into many smaller “chunks” and reassembles them when requested. The implementation of GridFS maintains two collections: fs.files, which contains the metadata for the large files and their associated chunks, and fs.chunks, which contains the large data broken into 255KB chunks. With images and videos, compression will probably be beneficial for the fs.files collection, but the data contained in fs.chunks is probably already compressed, and so it may make sense to disable compression for this collection. Compression options in MongoDB 3.0 In MongoDB 3.0, WiredTiger provides three compression options for collections: No compression Snappy (enabled by default) – very good compression, efficient use of resources zlib (similar to gzip) – excellent compression, but more resource intensive There are two compression options for indexes: No compression Prefix (enabled by default) – good compression, efficient use of resources You may wonder why the compression options for indexes are different than those for collections. Prefix compression is fairly simple – the “prefix” of values is deduplicated from the data set. This is especially effective for some data sets, like those with low cardinality (eg, country), or those with repeating values, like phone numbers, social security codes, and geo-coordinates. It is especially effective for compound indexes, where the first field is repeated with all the unique values of second field. Prefix indexes also provide one very important advantage over Snappy or zlib – queries operate directly on the compressed indexes, including covering queries. When compressed collection data is accessed from disk, it is decompressed in cache. With prefix compression, indexes can remain compressed in RAM. We tend to see very good compression with indexes using prefix compression, which means that in most cases you can fit more of your indexes in RAM without sacrificing performance for reads, and with very modest impact to writes. The compression rate will vary significantly depending on the cardinality of your data and whether you use compound indexes. Some things to keep in mind that apply to all the compression options in MongoDB 3.0: Random data does not compress well Binary data does not compress well (it may already be compressed) Text compresses especially well Field names compress well in documents (the additional benefits of short field names are modest) Compression is enabled by default for collections and indexes in the WiredTiger storage engine. To explicitly set the compression for the replica at startup, specify the appropriate options in the YAML config file . use the command line option -- wiredTigerCollectionBlockCompressor . Because WiredTiger is not the default storage engine in MongoDB 3.0, you’ll also need to specify the -- storageEngine option to use WiredTiger and take advantage of these compression features. To specify compression for specific collections, you’ll need to override the defaults by passing the appropriate options in the db.createCollection() command. For example, to create a collection called email using the zlib compression library: db.createCollection( "email", { storageEngine: { wiredTiger: { configString: "blockCompressor=zlib" }}}) How to measure compression The best way to measure compression is to separately load the data with and without compression enabled, then compare the two sizes. The db.stats() command returns many different storage statistics, but the two that matter for this comparison are storageSize and indexSize. Values are returned in bytes, but you can convert to MB by passing in 1024*1024: > db.stats(1024*1024).dataSize + db.stats(1024*1024).indexSize 1406.9201011657715 This is the method we used for the comparisons provided below. Testing compression on different data sets Let’s take a look at some different data sets to see how some of the compression options perform. We have four databases: Enron This is the infamous Enron email corpus . It includes about a half million emails. There’s a great deal of text in the email body fields, and some of the metadata has low cardinality, which means that they’re both likely to compress well. Here’s an example (the email body is truncated): { "_id" : ObjectId("4f16fc97d1e2d32371003e27"), "body" : "", "subFolder" : "notes_inbox", "mailbox" : "bass-e", "filename" : "450.", "headers" : { "X-cc" : "", "From" : "michael.simmons@enron.com", "Subject" : "Re: Plays and other information", "X-Folder" : "\\Eric_Bass_Dec2000\\Notes Folders\\Notes inbox", "Content-Transfer-Encoding" : "7bit", "X-bcc" : "", "To" : "eric.bass@enron.com", "X-Origin" : "Bass-E", "X-FileName" : "ebass.nsf", "X-From" : "Michael Simmons", "Date" : "Tue, 14 Nov 2000 08:22:00 -0800 (PST)", "X-To" : "Eric Bass", "Message-ID" : "<6884142.1075854677416.JavaMail.evans@thyme>", "Content-Type" : "text/plain; charset=us-ascii", "Mime-Version" : "1.0" } } Here’s how the different options performed with the Enron database: Flights The US Federal Aviation Administration (FAA) provides data about on-time performance of airlines . Each flight is represented as a document. Many of the fields have low cardinality, so we express this data set to compress well: { "_id" : ObjectId("53d81b734aaa3856391da1fb"), "origin" : { "airport_seq_id" : 1247802, "name" : "JFK", "wac" : 22, "state_fips" : 36, "airport_id" : 12478, "state_abr" : "NY", "city_name" : "New York, NY", "city_market_id" : 31703, "state_nm" : "New York" }, "arr" : { "delay_group" : 0, "time" : ISODate("2014-01-01T12:38:00Z"), "del15" : 0, "delay" : 13, "delay_new" : 13, "time_blk" : "1200-1259" }, "crs_arr_time" : ISODate("2014-01-01T12:25:00Z"), "delays" : { "dep" : 14, "arr" : 13 }, "taxi_in" : 5, "distance_group" : 10, "fl_date" : ISODate("2014-01-01T00:00:00Z"), "actual_elapsed_time" : 384, "wheels_off" : ISODate("2014-01-01T09:34:00Z"), "fl_num" : 1, "div_airport_landings" : 0, "diverted" : 0, "wheels_on" : ISODate("2014-01-01T12:33:00Z"), "crs_elapsed_time" : 385, "dest" : { "airport_seq_id" : 1289203, "state_nm" : "California", "wac" : 91, "state_fips" : 6, "airport_id" : 12892, "state_abr" : "CA", "city_name" : "Los Angeles, CA", "city_market_id" : 32575 }, "crs_dep_time" : ISODate("2014-01-01T09:00:00Z"), "cancelled" : 0, "unique_carrier" : "AA", "taxi_out" : 20, "tail_num" : "N338AA", "air_time" : 359, "carrier" : "AA", "airline_id" : 19805, "dep" : { "delay_group" : 0, "time" : ISODate("2014-01-01T09:14:00Z"), "del15" : 0, "delay" : 14, "delay_new" : 14, "time_blk" : "0900-0959" }, "distance" : 2475 } Here’s how the different options performed with the Flights database: MongoDB Config Database This is the metadata MongoDB stores in the config database for sharded clusters. The manual describes the various collections in that database. Here’s an example from the chunks collection, which stores a document for each chunk in the cluster: { "_id" : "mydb.foo-a_\"cat\"", "lastmod" : Timestamp(1000, 3), "lastmodEpoch" : ObjectId("5078407bd58b175c5c225fdc"), "ns" : "mydb.foo", "min" : { "animal" : "cat" }, "max" : { "animal" : "dog" }, "shard" : "shard0004" } Here’s how the different options performed with the config database: TPC-H TPC-H is a classic benchmark used for testing relational analytical DBMS. The schema has been modified to use MongoDB’s document model. Here’s an example from the orders table with only the first of many line items displayed for this order: { "_id" : 1, "cname" : "Customer#000036901", "status" : "O", "totalprice" : 173665.47, "orderdate" : ISODate("1996-01-02T00:00:00Z"), "comment" : "instructions sleep furiously among ", "lineitems" : [ { "lineitem" : 1, "mfgr" : "Manufacturer#4", "brand" : "Brand#44", "type" : "PROMO BRUSHED NICKEL", "container" : "JUMBO JAR", "quantity" : 17, "returnflag" : "N", "linestatus" : "O", "extprice" : 21168.23, "discount" : 0.04, "shipinstr" : "DELIVER IN PERSON", "realPrice" : 20321.5008, "shipmode" : "TRUCK", "commitDate" : ISODate("1996-02-12T00:00:00Z"), "shipDate" : ISODate("1996-03-13T00:00:00Z"), "receiptDate" : ISODate("1996-03-22T00:00:00Z"), "tax" : 0.02, "size" : 9, "nation" : "UNITED KINGDOM", "region" : "EUROPE" } ] } Here’s how the different options performed with the TPC-H database: Twitter This is a database of 200K tweets. Here’s a simulated tweet introducing our Java 3.0 driver: { "coordinates": null, "created_at": "Fri April 25 16:02:46 +0000 2010", "favorited": false, "truncated": false, "id_str": "", "entities": { "urls": [ { "expanded_url": null, "url": "http://mongodb.com", "indices": [ 69, 100 ] } ], "hashtags": [ ], "user_mentions": [ { "name": "MongoDB", "id_str": "", "id": null, "indices": [ 25, 30 ], "screen_name": "MongoDB" } ] }, "in_reply_to_user_id_str": null, "text": "Introducing the #Java 3.0 driver for #MongoDB http://buff.ly/1DmMTKu", "contributors": null, "id": null, "retweet_count": 12, "in_reply_to_status_id_str": null, "geo": null, "retweeted": true, "in_reply_to_user_id": null, "user": { "profile_sidebar_border_color": "C0DEED", "name": "MongoDB", "profile_sidebar_fill_color": "DDEEF6", "profile_background_tile": false, "profile_image_url": "", "location": "New York, NY", "created_at": "Fri April 25 23:22:09 +0000 2008", "id_str": "", "follow_request_sent": false, "profile_link_color": "", "favourites_count": 1, "url": "http://mongodb.com", "contributors_enabled": false, "utc_offset": -25200, "id": null, "profile_use_background_image": true, "listed_count": null, "protected": false, "lang": "en", "profile_text_color": "", "followers_count": 159678, "time_zone": "Eastern Time (US & Canada)", "verified": false, "geo_enabled": true, "profile_background_color": "", "notifications": false, "description": "Community conversation around the MongoDB software. For official company news, follow @mongodbinc.", "friends_count": , "profile_background_image_url": "", "statuses_count": 7311, "screen_name": "MongoDB", "following": false, "show_all_inline_media": false }, "in_reply_to_screen_name": null, "source": "web", "place": null, "in_reply_to_status_id": null } Here’s how the different options performed with the Twitter database: Comparing compression rates The varying sizes of these databases make them difficult to compare side by side in terms of absolute size. We can take a closer look at the benefits by comparing the storage savings provided by each option. To do this, we compare the size of each database using Snappy and zlib to the uncompressed size in WiredTiger. As above, we’re adding the value of storageSize and indexSize. Another way some people describe the benefits of compression is in terms of the ratio of the uncompressed size to the compressed size. Here’s how Snappy and zlib perform across the five databases. How to test your own data There are two simple ways for you to test compression with your data in MongoDB 3.0. If you’ve already upgraded to MongoDB 3.0, you can simply add a new secondary to your replica set with the option to use the WiredTiger storage engine specified at startup. While you’re at it, make this replica hidden with 0 votes so that it won’t affect your deployment. This new replica set member will perform an initial sync with one of your existing secondaries. After the initial sync is complete, you can remove the WiredTiger replica from your replica set then connect to that standalone to compare the size of your databases as described above. For each compression option you want to test, you can repeat this process. Another option is to take a mongodump of your data and use that to restore it into a standalone MongoDB 3.0 instance. By default your collections will use the Snappy compression option, but you can specify different options by first creating the collections with the appropriate setting before running mongorestore, or by starting mongod with different compression options. This approach has the advantage of being able to dump/restore only specific databases, collections, or subsets of collections to perform your testing. For examples of syntax for setting compression options, see the section “How to use compression.” A note on capped collections Capped collections are implemented very differently in the MMAP storage engines as compared to WiredTiger (and RocksDB). In MMAP space is allocated for the capped collection at creation time, whereas for WiredTiger and RocksDB space is only allocated as data is added to the capped collection. If you have many empty or mostly-empty capped collections, comparisons between the different storage engines may be somewhat misleading for this reason. If you’re considering updating your version of MongoDB, take a look at our Major Version Upgrade consulting services: UPGRADE WITH CONFIDENCE About the Author - Asya Asya is Lead Product Manager at MongoDB. She joined MongoDB as one of the company's first Solutions Architects. Prior to MongoDB, Asya spent seven years in similar positions at Coverity, a leading development testing company. Before that she spent twelve years working with databases as a developer, DBA, data architect and data warehousing specialist.

April 30, 2015

Performance Testing MongoDB 3.0 Part 1: Throughput Improvements Measured with YCSB

Intro A major focus for MongoDB 3.0 has been improving performance, especially write performance and hardware utilization. To help illustrate what we’ve done and how to take advantage of the changes we’ve made, we will be publishing a series of blog posts comparing MongoDB 2.6 and 3.0 performance. As with any benchmarking, the applicability of the results to your application is not always clear. Use cases for MongoDB are diverse, and it is critical to use performance tests that reflect the needs of your application and the hardware you will use for your deployment. As such, there’s really no “standard” benchmark that will inform you about the best technology to use for your application. Only your requirements, your data, and your infrastructure can tell you what you need to know. To help us measure performance, we use hundreds of different tests that we have developed working with the community. These tests reflect the diverse applications users build, and the ever-evolving environments in which they are deployed. YCSB is used by some organizations as part of their performance testing for several different database technologies. YCSB is fairly basic and probably does not tell you everything you need to know about the performance of your application. However, it is fairly popular and understood by users of MongoDB and other systems. In this post we’ll compare YCSB results for MongoDB 2.6 and 3.0. Throughput In YCSB tests, MongoDB 3.0 provides around 7x the throughput of MongoDB 2.6 for multi-threaded, batch inserts. We should expect to see the biggest improvement for this workload because it is 100% writes, and WiredTiger’s document-level concurrency control is most beneficial to multi-threaded write workloads on servers with many processor cores. The second test compares the two systems for a workload that is 95% reads and 5% updates. Here we see approximately 4x better throughput with WiredTiger. This is a smaller improvement than observed for the load because writes are only 5% of all operations. In MongoDB 2.6 concurrency control is managed at the database level, and writes can block reads, reducing overall throughput. Looking at this test, the more fine-grained concurrency control of MongoDB 3.0 clearly improves overall throughput. Finally, for the balanced workload we see over 6x better throughput with MongoDB 3.0. This is better than the 4x improvement we see with the 95% read workload because there are more writes. Latency Measuring throughput isn’t enough – it is also important to consider the latency of operations. Average latency measured across many operations is not the best metric. Developers who want to ensure a consistently great, low-latency experience worry about the worst performing queries in their deployment. High latency queries are measured at the 95th and 99th percentiles – where observed latency is worse than 95% or 99% of all other latencies. (One could argue these are insufficiently precise – most web sessions involve hundreds of requests, and so it is very likely that most users will experience latency at the 99th percentile during their session.) We see very little difference between MongoDB 2.6 and MongoDB 3.0 in terms of read latency: reads are consistently 1 ms or less across workloads. For update latency, however, the results are more interesting. Here we compare the update latency at the 95th and 99th percentiles using the read-intensive workload. Update latency is significantly improved in MongoDB 3.0: it has been reduced by almost 90% at both the 95th and 99th percentiles. This is important - improving throughput should not come at the cost of greater latency as this will ultimately degrade the experience for users of the application. In the balanced workload, update latency is lower still. At the 95th percentile, update latency for MongoDB 3.0 is almost 90% lower than MongoDB 2.6, and over 80% lower at the 99th percentile. As a result of these improvements, users should experience better, more predictable performance. We believe these tests for throughput and latency demonstrate a major improvement in the write performance for MongoDB. Small Changes That Make A Big Impact In future posts we will describe a number of small changes that can make a big impact to MongoDB performance. As a preview, let’s take a look at one of the factors we see people overlook frequently. Providing Sufficient Client Capacity The default configuration for YCSB uses one thread. With a single thread you will likely observe fairly poor throughput with any database. Don’t use a single threaded benchmark unless your application runs single threaded. Single threaded tests really only measure latency, not throughput, and capacity planning should consider both factors. Most databases work best with multiple client threads. Determine the optimal number by adding threads until the throughput stops increasing and/or the latency increases. Consider running multiple clients servers for YCSB. A single client may not be able to generate sufficient load to determine the capacity of the system. Unfortunately, YCSB does not make it easy to use more than one client – you have to coordinate starting and stopping the individual clients, and you have to manually aggregate their results. When sharding, start by allocating one mongos for every 1-2 shards, and one YCSB client per mongos. Too many clients can overwhelm the system, initially adding latency, but eventually starving the CPU. In some cases it may be necessary to throttle client requests. Finding the right balance of latency and throughput should be a part of any performance tuning exercise. By monitoring both and increasing the number of threads through a series of tests, you can determine a clear relationship between latency and throughput, and the optimal number of threads for a given workload. We can make two observations based on these results: The 99th percentile for all operations is less than 1ms up to 16 threads. With more than 16 threads, latency begins to rise. Throughput rises from 1 to 64 threads. After 64 threads, increasing the thread count does not increase throughput, yet it does increase latency. Based on these results, the optimal thread count for the application is somewhere between 16 and 64 threads, depending on whether we favor latency or throughput. At 64 threads, latency still looks quite good: the 99th percentile for reads is less than 1ms, and the 99th percentile for writes is less than 4ms. Meanwhile, throughput is over 130,000 ops/sec. YCSB Test Configurations We tested many different configurations to determine the optimal balance of throughput and latency. For these tests we used 30 million documents and 30 million operations. Documents included 1 field of 100 bytes (151 bytes total). Records were selected using the Zipfian distribution. Results reflect the optimal number of threads, which was determined by increasing the number of threads until the 95th and 99th percentile latency values began to rise and the throughput stopped increasing. All tests use a replica set with journaling enabled, and environments were configured following our best practices . Always use replica sets for production deployments. The YCSB client ran on a dedicated server. Each replica set member also ran on a dedicated server. All servers were Softlayer bare metal machines with the following specifications: CPU: 2x Deca Core Xeon 2690 V2 - 3.00GHz (Ivy Bridge) - 2 x 25MB cache RAM: 128 GB Registered DDR3 1333 Storage: 2x 960GB SSD drives, SATA Disk Controller Network: 10 Gbps OS: Ubuntu 14.10 (64 bit) MongoDB Versions: MongoDB 2.6.7; MongoDB 3.0.1 MongoDB's YCSB client code Learn more MongoDB. Register for our free online courses: Learn MongoDB for free About the Author - Asya Asya is Lead Product Manager at MongoDB. She joined MongoDB as one of the company's first Solutions Architects. Prior to MongoDB, Asya spent seven years in similar positions at Coverity, a leading development testing company. Before that she spent twelve years working with databases as a developer, DBA, data architect and data warehousing specialist.

March 16, 2015