Chris Biow

2 results

Building an Inexpensive Petabyte Database with MongoDB and Amazon Web Services: Part 1

td { padding: 6px; } Ah, but a man's reach should exceed his grasp, Or what's a heaven for? - Robert Browning, "Andrea del Sarto" Preface I’ve always been fascinated by questions of scalability. Working for then-leading enterprise search vendor Verity around 2005, we developed a sub-specialty implementing text search applications for our largest customers. Anything over 30 million documents stretched the already obsolescent Verity engine to its limits. Such implementations were clearly a level beyond mere “enterprise scale,” and there was no common term for larger problems, so we started referring to them as “empire scale.” Today, standard terminology in IT has converged on the term “Big Data,” to describe problems whose very scale dictates the software and techniques used to process them. It also describes a good part of what we at MongoDB address with our customers. Thousands of organizations have adopted MongoDB, primarily because it has enabled fast deployment and agility at all dimensions of scalability, such as databases providing subsecond response over billions of documents , or hosting hundreds of thousands of mobile applications on a single platform . Challenge With customers having explored so many dimensions of scalability, we wanted to explore scalability in terms of a data size that would be instantly recognizable, but in practical terms. We settled on a petabyte of data, with the goal of creating a database as inexpensively as possible. How much is a petabyte? Under the SI system , it’s abbreviated PB and equal to 10^15 bytes, not to be confused with IEC system’s “pebibyte”, abbreviated as PiB and equal to 2^50. Since petabyte is the more common standard, we chose that as our target. Strategy Amazon Web Services (AWS) was our natural choice of deployment target because it is a very popular environment for MongoDB. Their pricing fits well, with hourly billing for disk and server resources. The key to MongoDB write performance, and indeed that of most databases, is storage performance. Specifically, since MongoDB uses extent-based storage of documents and B-Trees indexes, random seeks ( IOPS ) tend to be the storage attribute that drives performance and cost at scale. This is in contrast with Log-Structured Merge (LSM) Tree storage, which uses sequential writes of DB segments, but then later pays an overhead to merge segments. It is worth noting that with MongoDB 3.0 users will be able to chose which of these approaches is best for their application. In this post all of our tests are based on MongoDB 2.6. AWS provides three fundamental types of storage, each with implicit limitations on total disk space, IOPS, and sequential throughput, per machine instance: Elastic Block Storage ( EBS ). The Provisioned IOPS ( PIOPS ) option is best for write-intensive MongoDB applications, as ours would be. At the time we ran these tests, EBS volumes were limited to maximum 1TB and 4000 PIOPS, with at most 24 volumes being mounted to each machine instance. Thus, per-instance limits are 24TB and 96K IOPS. Block size per PIOP was 16KB, for a maximum sequential throughput of 1.5 GB/s. In practice, throughput is limited by the physical 10Gbps limit on all network traffic to an instance, including EBS. Ephemeral SSD. i2.8xlarge instances feature 8 SSDs of 800GB each, for a total of 6.4TB and on the order of 300K IOPS at a 4KB block size, for 1.2GB/s sequential throughput. Ephemeral Spinning Disk. AWS’s hs1.8xlarge provides an array of 24 internal SATA drives of 2TB each. Net of some roundoff errors, they can be configured in a single RAID0 array of 46TB, in principle producing around 2.5K IOPS and up to 2.5GB/s sequential throughput. In tabular form, with costs added (based on the us-west-2/Oregon region, as of October 2014), we get: Instance Type Capacity (TB) IOPS (K) Sequential (GB/s) Cost : instance and storage ($/hr) Cost: PB total storage ($/hr) r3.8xlarge/ EBS 24 96 1.5 15.63 $657 i2.8xlarge 6.4 300 1.2 6.82 $1070 hs1.8xlarge 46 2.5? 2.5 4.60 $101 The far right column is the cost to maintain a petabyte of storage online. It doesn’t address how long it takes to actually write a petabyte database with MongoDB. As we mentioned above, it’s _mostly_ IOPS that tend to limit write performance, though all the measures of performance need to be kept in mind. The IOPS requirement isn’t entirely an independent variable, depending quite a bit on block size of EBS and SSD, kinematics of spinning disk, document size, and details of synchronizing memory mapped files from RAM to disk. Exploration So the next step was to experiment with each of our options. There are practicalities of access to these AWS resources, aside from sheer cost. EBS PIOPS are limited resources. By default, each AWS account is currently limited to 40K PIOPS. If we went with the maximum of 24TB and 96K PIOPS per instance, we’d need 42 instances for our petabyte, at 4M total PIOPS. While Amazon would be happy to sell that high a level of PIOPS usage, it represents a significant capital investment on their part, so they’d want some guarantee it would be used for more than a brief test. As it turned out, due to our close corporate partnership with AWS, we had a few days’ availability last year of 1M PIOPS, while Amazon was stocking up for their own customers’ Christmas rush. It let us get comfort with petascale usage of EBS PIOPS, but the window wasn’t long enough to complete a full petabyte load. SSD instances are also limited in availability. We did test smaller clusters of i2.8xlarge instances and found that we could fully load them in about two hours. So for $2.2K, we could build a petabyte using a cluster of 160 of these servers. Amazon was cooperative in raising our account limit, but as they warned us, there physically weren’t enough of these servers available at any given time in a single region. We had no problem raising our default limit for hs1.8xlarge instances and in provisioning up to 30 of them at a time. They could be bulk-loaded in 24 hours, or about $2.5K in costs. Approach Having chosen our (virtual) hardware architecture, we had to decide what sort of data and loading system we’d use, as well as all the usual configuration choices for hardware and virtualization. Data Source The Yahoo Cloud Serving Benchmark (YCSB) has been widely used for testing different kinds of databases. Having to take something of a Least Common Denominator approach over all databases, it’s almost entirely key-value oriented, using none of the differentiators that have so dramatically broken out MongoDB from other databases. For that reason, it’s of little value in making comparisons, outside of the bare-bones, key-value stores for which it was written. But for all that, it’s readily available, standard, and provides good parallel threading when configured appropriately, so we tend to use it for load generation. However, we have fixed and customized the base build to better work with MongoDB. First, if you are going to use YCSB with MongoDB, you’ll at least want to use Achille Brighton’s fork of YCSB . Achille has continued maintenance of YCSB over the last couple of years, including MongoDB Java driver updates, corrections to connection pool sizing, elimination of superfluous error checks, explicit cursor closing, standard write concern levels, and other key updates. Please also contact us at MongoDB for assistance, so that we can provide guidance in configurations appropriate to your requirements. For this project we configured the setup as follows: Client Setup: for our purpose of parallel bulk-loading, on instances that had ample, spare CPU, it made the most sense to run YCSB on the shard servers, pointed directly at the local mongod processes. This saves a network hop, circumventing the typical AWS network limitations. Server Setup: as summarized above, we used Amazon machine instances with a large array of spinning disks, being most cost-effective for petascale storage. More on that in a minute... Network Setup: although our parallel loading approach minimized network overhead, we took advantage of AWS options wherever appropriate. We also added some project-specific enhancements to YCSB. First, even using large documents, we were going to need more than 2 billion documents to add up to a petabyte. We therefore had to revise the primary document IDs to use long integers. While we were at it, for efficiency, we changed the type of the IDs from the form “user1234” to the long integer itself. We also wanted to add document field values that would be more useful than the random byte sequences used in YCSB’s data fields. We therefore added, for each of the 8 random data fields, a parallel field with an integer value, randomly generated with a zipfian , or long-tail distribution. These 8 integer fields would support queries with meaningful selection criteria, allowing specification of result sets of any size, according to the zipfian distribution. Moreover, those result sets could use MongoDB Aggregation over the other integer fields, retrieving sums, averages, etc. Cloud and System Specifics For network optimization, we constrained the servers to a single AWS Availability Zone , keeping latency low. Launching all servers in a single AWS Placement Group would have provided even better network locality, but we weren’t able to provision 30 of our servers in a single Placement Group . Since our objective was sheer data volume and speed, for purposes of Proof of Concept, we configured the 24 disks as a single, RAID 0 (striped) array. We used the XFS filesystem with 128MB log size, mounted as recommended in our docs . If this were for production purposes, you’d probably use RAID 10 (mirrored/striped), which would double the server requirement. For an operating system, we chose Ubuntu 13.10 for its combination of XFS filesystem support and working memory cgroups. From the stock Amazon Machine Instance (AMI), we created our own AMI incorporating security updates, Oracle Java 7, recommended ulimit settings , munin-node monitoring software, Global Shell to run commands across all servers in the cluster, and our modified YCSB. Upon boot, a cloud-init script took care of configuring the disks and launching MongoDB. To monitor the entire cluster, our natural choice was MongoDB Management Service (MMS) , which we supplemented with machine-specific graphic capabilities of AWS CloudWatch . MongoDB Specifics Stereotypically, spinning SATA disks have rather low random IOPS performance, around 100 per disk. But when you have 24 of them in RAID 0, you get, well, 24 times the write performance, giving a perfectly respectable 2.4K IOPS. To provide optimal loading of that array, we used four MongoDB shards on each server. This approach may be warranted when dealing with large or fast disk arrays. To ensure that the mongod server processes did not contend for memory with each other or with the YCSB Java processes, we segregated each of them in its own section of RAM, using Linux cgroups . The shards themselves were pre-split with one chunk per shard, evenly distributed over the range of primary IDs that we would load. A script then turned on the balancer process and monitored the migration of these empty chunks to each shard, turning the balancer back off once they were 1:1 with the shards. As there would be no benefit to splitting these chunks, we ran the mongos router processes with AutoSplit=off. Since this was intended to be a bulk load, and we wanted to avoid network overhead, we launched a YCSB process locally, alongside each shard, writing directly to the mongod server process, using a range of IDs to match the chunk on that shard. We created indexes on the primary key, as required for sharding, and also on one of the zipfian-distributed integer fields, so that we could test queries more “interesting” than the simple key-value retrievals offered by YCSB. Ready to Run That’s it for the background and decision-making! At this point, we were ready to launch the cluster and see how it would progress toward the petabyte goal. Learn more about operations best practices here: DOWNLOAD OPS BEST PRACTICES In the next post we’ll examine what happened. About Chris Biow Chris Biow is Principal Technologist and Technical Director at MongoDB, providing technical leadership for MongoDB field activities, with particular interest in enterprise database Platform-as-a-Service offerings and in the practical frontier of scaling capability. Chris has worked in Big Data and NoSQL since well before the terms were coined, as Principal Architect at Verity/Autonomy, Federal CTO/VP at MarkLogic, and with Naval Research Laboratory as a Reservist. His work includes Big Data projects with Fortune 10 companies, US Department of Defense, National Archives, and Bloomberg News. His career also includes 8 years as a Radar Intercept Officer, flying the aircraft-carrier-based F-14 Tomcat. Chris is a graduate of the United States Naval Academy and earned his Master’s degree from the University of Maryland.

November 19, 2014

Building an Inexpensive Petabyte Database with MongoDB and Amazon Web Services: Part 2

Picking Back Up When we left off in the previous post , we’d chosen a goal of building a petabyte MongoDB database, evaluated the options for achieving it, and selected an approach based on Amazon Web Services (AWS) instances using spinning disk. At least for this pair of posts, we’ll skip details on exactly how we scripted things, and look at what happened. Result In short, it worked: mongos> db.usertable.stats() { "objects" : 7170648489, "avgObjSize" : 147438.99952658816, "dataSize" : NumberLong("1057240224818640"), "storageSize" : NumberLong("1057709186089744"), "numExtents" : 496951, "indexes" : 432, "indexSize" : 649064661728, "fileSize" : NumberLong("1059468228952064"), "ok" : 1 } Yeah, integer notation doesn’t parse well visually when we get past about nine digits. Commifying helps: 1,057,240,224,818,640 bytes. But logarithms are really our friend, so we can express it as bytes = 1.057 x 1015 log10(bytes) = 15.024, just over a petabyte. log2(bytes) = 49.9. Disappointing: we just barely missed 250, or a pebibyte. I’d actually planned to get over this number, but a few of the YCSB processes aborted early. It’s fun to watch that much data flying through the cluster. We used AWS CloudWatch, which can be configured to provide per-minute reporting and then will graph up to ten servers at a time. Here’s a look over the first ten nodes of the cluster during the load. You can see a degree of asymmetry as the loads progressed, as would be expected across 240 physical SATA drives. Servers take anywhere from 17 to 22 hours to complete their loads, partly due to some YCSB processes (silently) terminating early. Within each node, you can see the completion of the node’s four YCSB processes in the step-down pattern from the full, parallel load rate. The initial load rate per server is just about 50GB/minute, which works out to 844MB/s. iostat indicated sustained disk utilization around 50%, with CPU indicating as the bottleneck, mostly due to system load (i.e. RAID overhead). Alternatively, we can look at operations per minute. Over the 24-drive RAID array on each instance, AWS is reporting 1.5M IOPM, which works out to 25K IOPS, or 1K IOPS per drive. That’s about an order of magnitude more seeks than a 7K RPM SATA drive can actually deliver, so AWS is giving credit for OS-level operations that are successfully being optimized for sequential writes before they make it to the physical disks. Our bottleneck actually turns out to be CPU, given that the YCSB clients were running locally, generating a petabyte of data: At around $100/hr, we didn’t want to keep the cluster up for very long. While we’d modified YCSB to handle >2B documents for load, the “run” side hadn’t been updated yet. So we limited our read testing to a few, relatively simple queries. Here’s where the zipfian distribution of the indexed integer field let us tune result set sizes. Here’s one that nicely spreads the load over all 216 shards, with the total response time coming in at just under three seconds: db.usertable.find({fieldInt0: 17000}, {fieldInt0: 1, fieldInt1: 1}).explain() ... "cursor" : "BtreeCursor fieldInt0_1", "n" : 30747, "nChunkSkips" : 11116, "nYields" : 0, "nscanned" : 41863, "nscannedAllPlans" : 41863, "nscannedObjects" : 41863, "nscannedObjectsAllPlans" : 41863, "millisShardTotal" : 395210, "millisShardAvg" : 1829, "numQueries" : 216, "numShards" : 216, "millis" : 2702 } We can also use the MongoDB Aggregation framework to execute more sophisticated distributed querying. Here we used indexed selection of one zipfian integer field to count the most common five values of a second field, showing a (somewhat odd) zipfian profile. The lack of values 2 and 5 implies that YCSB’s zipfian generator may not be working quite correctly. mongos> db.usertable.aggregate( ... {$match: {fieldInt0: 16000}}, ... {$group: {_id: "$fieldInt1", total: {$sum: "$fieldInt2"}}}, ... { $sort: { total: -1 } }, ... { $limit : 5 } ... ) { "result" : [ { "_id" : 0, "total" : 5160596 }, { "_id" : 1, "total" : 2367352 }, { "_id" : 3, "total" : 2243450 }, { "_id" : 4, "total" : 1402324 }, { "_id" : 6, "total" : 1061036 } ], "ok" : 1 } We also used the BenchRun utility to test parallel execution across the entire cluster, showing that we could achieve a little over 800 queries per second using a FindOne() operation on a random integer value, against our indexed zipfian integer field. Further Work More petabyte scale MongoDB As we saw during the cost analysis, it’s not surprisingly the storage that determines our costs at petascale. hs1 ephemeral storage is the cheapest, but it’s still a little over a $100/hr to keep a petabyte of it spinning. Being ephemeral, it goes away as soon as we shut the instances down. But it takes less than 24 hours to recreate it, so we’ve got a petabyte on tap, any time we want to spend the roughly $3K to recreate it. The obvious next step will be to measure more on the full, loaded petabyte, including various mixes of read and update load. That means extending our changes from YCSB’s “load” mode to its “run” mode. We can also measure more about the key features of MongoDB by which we depart from the simple key/value store model, such as secondary indexes and aggregation. This would also give a broad indication of how query rate scales, when properly spread over multiple mongos processes. Document size is another dimension to explore, which is closely tied to the optimal block size of the storage back-end. We could go back to ephemeral or shared SSD storage, which continues to progress quickly in pricing and availability on both AWS and Google Compute Engine. Limited experiments since the petabyte run have shown that we can achieve near-optimal load rates with 10KB documents on spinning disk, 16KB with PIOPS EBS, and as low as 4KB for instance SSD. More of your requirements The real key is what you, dear reader, can do with MongoDB. We’ve collected a number of examples of highly-scaled MongoDB implementations from among our customers. But every dimension of scalability can be extended further: cluster scale, data size, read and write rates, and varieties of data and application. Please let me know how you’ve been able to push the boundaries and where you’d like to go further. I’m now leading a group of Performance Champions within MongoDB, and we’d love to help with your tough challenges. If you'd like to learn more, try starting with our Operations Best Practices white paper . Download Ops Best Practices About the Author Chris Biow is Principal Technologist and Technical Director at MongoDB, providing technical leadership for MongoDB field activities, with particular interest in enterprise database Platform-as-a-Service offerings and in the practical frontier of scaling capability. Chris has worked in Big Data and NoSQL since well before the terms were coined, as Principal Architect at Verity/Autonomy, Federal CTO/VP at MarkLogic, and with Naval Research Laboratory as a Reservist. His work includes Big Data projects with Fortune 10 companies, US Department of Defense, National Archives, and Bloomberg News. His career also includes 8 years as a Radar Intercept Officer, flying the aircraft-carrier-based F-14 Tomcat. Chris is a graduate of the United States Naval Academy and earned his Master’s degree from the University of Maryland.

November 19, 2014