Building your business is hard. Scaling your business data should not be.
That's the message Sailthru CTO and co-founder Ian White relayed recently in New York. Over the course of a half-hour, White explained how Sailthru first did application-level sharding of its data out of necessity, but later moved to MongoDB's auto-sharding to massively simplify development.
Success in the Billions
Sailthru makes it easy for ecommerce and media brands to personalize content across a variety of channels, including email, onsite, mobile, social and more. As the company's customer base has swelled to over a billion users, 125 million content documents (e.g., URLs and products relevant to particular users) and 5 billion messages per month, Sailthru has come to store over 40 terabytes of data in MongoDB across 120 nodes on mostly physical infrastructure.
As White suggests, "You can’t store this volume of data on just one node. We had to shard."
Application-level Sharding at Sailthru
When Sailthru first started, it didn't need sharding. But within two years Sailthru's customer count and data volumes were high enough
that the company needed to partion its data. The question was: How?
While some applications are either read heavy (online media site) or write heavy (logging and clickstream), Sailthru is both. As White explains, "We have to be able to read data and write personalized recommendations in real-time. MongoDB is a great database for this."
Sailthru adopted MongoDB in the early days -- over four years ago. Prior to MongoDB 1.6, Sailthru partitioned much of its infrastructure using in-app sharding logic, as MongoDB didn't yet support auto-sharding. Sailthru partitioned data by client. Their application would examine each query, and dispatch to the appropriate replica set and collections based on a mapping configuration. This approach worked fine for a time at Sailthru.
However, as Sailthru’s data grew, application-level sharding introduced significant code complexity and administration overhead. Application-level sharding also contributed to uneven load distribution, something Sailthru was able to Band-Aid by scaling up with more expensive servers. But the database team still had to manually rebalance and reallocate resources – every time Sailthru onboarded a sizable client that required a new shard, the database team would have to go in and add another line to the config file and redeploy. It was painful and demanding.
With the introduction of automatic sharding
in 2010’s 1.6 release, the database itself manages the effort of distributing and balancing data across shards automatically. Sharding is transparent to applications – for 1 or 100 shards, the application code is the same.
Setting up a sharded cluster involves making a critical decision - choosing a shard key. The shard key is the value the databse uses to determine placement of the document within shards. The Sailthru team considered several options, including sharding on client ID, MongoDB ID, or email. MongoDB supports multiple sharding strategies, and each is appropriate for different use cases. Ultimately, they opted to use hash-based sharding and MongoDB’s ObjectId as the shard key. With this approach, MongoDB does the work of ensuring a uniform distribution of reads and writes by randomizing the placement of documents across shards.
To make the actual migration from application-level sharding to auto-sharding, the team used an open source tool created by MongoDB called MongoConnector. In the process of the migration, Sailthru forked the project, making significant contributions specific to their use case.
With this change, it’s now possible for Sailthru to add shards without making any change to the code base. This meant that during a critical ramp-up time of tight resources and tight cash, Sailthru was able to focus their engineering efforts on improving their service and building new features, ensuring their phenomenal success.
Build the Next Big Thing on MongoDB
Thousands of organizations use MongoDB to build high-performance systems at scale
If you're interested in reading up on your own, download our Operations Best Practices white paper for additional information on operating and deploying a MongoDB system:
Ops Best Practices
About Kelly Stirman
Kelly Stirman is Director of Products at MongoDB. Kelly works closely with customers, partners and the open-source community to articulate how MongoDB is quickly becoming the world's most popular database. For over 15 years he has worked at the forefront of database technologies. Prior to MongoDB, Kelly served in executive and leadership roles at Hadapt, MarkLogic, Oracle, GE, and PricewaterhouseCoopers.