Building an Inexpensive Petabyte Database with MongoDB and Amazon Web Services: Part 2

Chris Biow


Picking Back Up

When we left off in the previous post, we’d chosen a goal of building a petabyte MongoDB database, evaluated the options for achieving it, and selected an approach based on Amazon Web Services (AWS) instances using spinning disk.

At least for this pair of posts, we’ll skip details on exactly how we scripted things, and look at what happened.


In short, it worked:

mongos> db.usertable.stats()
"objects" : 7170648489,
"avgObjSize" : 147438.99952658816,
"dataSize" : NumberLong("1057240224818640"),
"storageSize" : NumberLong("1057709186089744"),
"numExtents" : 496951,
"indexes" : 432,
"indexSize" : 649064661728,
"fileSize" : NumberLong("1059468228952064"),
"ok" : 1

Yeah, integer notation doesn’t parse well visually when we get past about nine digits. Commifying helps: 1,057,240,224,818,640 bytes. But logarithms are really our friend, so we can express it as

  • bytes = 1.057 x 1015
  • log10(bytes) = 15.024, just over a petabyte.
  • log2(bytes) = 49.9. Disappointing: we just barely missed 250, or a pebibyte. I’d actually planned to get over this number, but a few of the YCSB processes aborted early.

It’s fun to watch that much data flying through the cluster. We used AWS CloudWatch, which can be configured to provide per-minute reporting and then will graph up to ten servers at a time. Here’s a look over the first ten nodes of the cluster during the load.

You can see a degree of asymmetry as the loads progressed, as would be expected across 240 physical SATA drives. Servers take anywhere from 17 to 22 hours to complete their loads, partly due to some YCSB processes (silently) terminating early. Within each node, you can see the completion of the node’s four YCSB processes in the step-down pattern from the full, parallel load rate. The initial load rate per server is just about 50GB/minute, which works out to 844MB/s. iostat indicated sustained disk utilization around 50%, with CPU indicating as the bottleneck, mostly due to system load (i.e. RAID overhead).

Alternatively, we can look at operations per minute.

Over the 24-drive RAID array on each instance, AWS is reporting 1.5M IOPM, which works out to 25K IOPS, or 1K IOPS per drive. That’s about an order of magnitude more seeks than a 7K RPM SATA drive can actually deliver, so AWS is giving credit for OS-level operations that are successfully being optimized for sequential writes before they make it to the physical disks.

Our bottleneck actually turns out to be CPU, given that the YCSB clients were running locally, generating a petabyte of data:

At around $100/hr, we didn’t want to keep the cluster up for very long. While we’d modified YCSB to handle >2B documents for load, the “run” side hadn’t been updated yet. So we limited our read testing to a few, relatively simple queries. Here’s where the zipfian distribution of the indexed integer field let us tune result set sizes. Here’s one that nicely spreads the load over all 216 shards, with the total response time coming in at just under three seconds:

db.usertable.find({fieldInt0: 17000}, {fieldInt0: 1, fieldInt1: 1}).explain()
	"cursor" : "BtreeCursor fieldInt0_1",
	"n" : 30747,
	"nChunkSkips" : 11116,
	"nYields" : 0,
	"nscanned" : 41863,
	"nscannedAllPlans" : 41863,
	"nscannedObjects" : 41863,
	"nscannedObjectsAllPlans" : 41863,
	"millisShardTotal" : 395210,
	"millisShardAvg" : 1829,
	"numQueries" : 216,
	"numShards" : 216,
	"millis" : 2702

We can also use the MongoDB Aggregation framework to execute more sophisticated distributed querying. Here we used indexed selection of one zipfian integer field to count the most common five values of a second field, showing a (somewhat odd) zipfian profile. The lack of values 2 and 5 implies that YCSB’s zipfian generator may not be working quite correctly.

mongos> db.usertable.aggregate(
...   {$match: {fieldInt0: 16000}},
...   {$group: {_id: "$fieldInt1", total: {$sum: "$fieldInt2"}}},
...   { $sort: { total: -1 } },
...   { $limit : 5 }
... )
	"result" : [
			"_id" : 0,
			"total" : 5160596
			"_id" : 1,
			"total" : 2367352
			"_id" : 3,
			"total" : 2243450
			"_id" : 4,
			"total" : 1402324
			"_id" : 6,
			"total" : 1061036
	"ok" : 1

We also used the BenchRun utility to test parallel execution across the entire cluster, showing that we could achieve a little over 800 queries per second using a FindOne() operation on a random integer value, against our indexed zipfian integer field.

Further Work

More petabyte scale MongoDB

As we saw during the cost analysis, it’s not surprisingly the storage that determines our costs at petascale. hs1 ephemeral storage is the cheapest, but it’s still a little over a $100/hr to keep a petabyte of it spinning. Being ephemeral, it goes away as soon as we shut the instances down. But it takes less than 24 hours to recreate it, so we’ve got a petabyte on tap, any time we want to spend the roughly $3K to recreate it.

The obvious next step will be to measure more on the full, loaded petabyte, including various mixes of read and update load. That means extending our changes from YCSB’s “load” mode to its “run” mode. We can also measure more about the key features of MongoDB by which we depart from the simple key/value store model, such as secondary indexes and aggregation. This would also give a broad indication of how query rate scales, when properly spread over multiple mongos processes.

Document size is another dimension to explore, which is closely tied to the optimal block size of the storage back-end. We could go back to ephemeral or shared SSD storage, which continues to progress quickly in pricing and availability on both AWS and Google Compute Engine. Limited experiments since the petabyte run have shown that we can achieve near-optimal load rates with 10KB documents on spinning disk, 16KB with PIOPS EBS, and as low as 4KB for instance SSD.

More of your requirements

The real key is what you, dear reader, can do with MongoDB. We’ve collected a number of examples of highly-scaled MongoDB implementations from among our customers. But every dimension of scalability can be extended further: cluster scale, data size, read and write rates, and varieties of data and application. Please let me know how you’ve been able to push the boundaries and where you’d like to go further. I’m now leading a group of Performance Champions within MongoDB, and we’d love to help with your tough challenges.

If you'd like to learn more, try starting with our Operations Best Practices white paper.

Download Ops Best Practices

About the Author

Chris Biow is Principal Technologist and Technical Director at MongoDB, providing technical leadership for MongoDB field activities, with particular interest in enterprise database Platform-as-a-Service offerings and in the practical frontier of scaling capability. Chris has worked in Big Data and NoSQL since well before the terms were coined, as Principal Architect at Verity/Autonomy, Federal CTO/VP at MarkLogic, and with Naval Research Laboratory as a Reservist. His work includes Big Data projects with Fortune 10 companies, US Department of Defense, National Archives, and Bloomberg News. His career also includes 8 years as a Radar Intercept Officer, flying the aircraft-carrier-based F-14 Tomcat. Chris is a graduate of the United States Naval Academy and earned his Master’s degree from the University of Maryland.