Why Open Source Is Essential To Big Data
Gartner analyst Merv Adrian recently highlighted some of the recent movements in Hadoop Land, with several companies introducing products "intended to improve Hadoop speed." This seems odd, as that wouldn't be my top pick for how to improve Hadoop or, really, most of the Big Data technologies out there. By many accounts, the biggest need in Hadoop is improved ease of use, not improved performance, something Adrian himself confirms : Hadoop already delivers exceptional performance on commodity hardware, compared to its stodgy proprietary competition. Where it's still lacking is in ease of use. Not that Hadoop is alone in this. As Mare Lucas asserts , Today, despite the information deluge, enterprise decision makers are often unable to access the data in a useful way. The tools are designed for those who speak the language of algorithms and statistical analysis. It’s simply too hard for the everyday user to “ask” the data any questions – from the routine to the insightful. The end result? The speed of big data moves at a slower pace … and the power is locked in the hands of the few. Lucas goes on to argue that the solution to the data scientist shortage is to take the science out of data science; that is, consumerize Big Data technology such that non-PhD-wielding business people can query their data and get back meaningful results. The Value Of Open Source To Deciphering Big Data Perhaps. But there's actually an intermediate step before we reach the Promised Land of full consumerization of Big Data. It's called open source. Even with technology like Hadoop that is open source yet still too complex, the benefits of using Hadoop far outweigh the costs (financial and productivity-wise) associated with licensing an expensive data warehousing or analytics platform. As Alex Popescu writes , Hadoop "allows experimenting and trying out new ideas, while continuing to accumulate and storing your data. It removes the pressure from the developers. That’s agility." But these benefits aren't unique to Hadoop. They're inherent in any open-source project. Now imagine we could get open-source software that fits our Big Data needs and is exceptionally easy to use plus is almost certainly already being used within our enterprises...? That is the promise of MongoDB, consistently cited as one of the industry's top-two Big Data technologies . MongoDB makes it easy to get started with a Big Data project. Using MongoDB To Innovate Consider the City of Chicago. The Economist wrote recently about the City of Chicago's predictive analytics platform, WindyGrid. What The Economist didn't mention is that WindyGrid started as a pet project on chief data officer Brett Goldstein's laptop. Goldstein started with a single MongoDB node, and iterated from there, turning it into one of the most exciting data-driven applications in the industry today. Given that we often don't know exactly which data to query, or how to query, or how to put data to work in our applications, this is precisely how a Big Data project should work. Start small, then iterate toward something big. This kind of tinkering simply is difficult to impossible with a relational database, as The Economist's Kenneth Cukier points out in his book, Big Data: A Revolution That Will Transform How We Live, Work, and Think : Conventional, so-called relational, databases are designed for a world in which data is sparse, and thus can be and will be curated carefully. It is a world in which the questions one wants to answer using the data have to be clear at the outset, so that the database is designed to answer them - and only them - efficiently. But with a flexible document database like MongoDB, it suddenly becomes much easier to iterate toward Big Data insights. We don't need to go out and hire data scientists. Rather, we simply need to apply existing, open-source technology like MongoDB to our Big Data problems, which jibes perfectly with Gartner analyst Svetlana Sicular's mantra that it's easier to train existing employees on Big Data technologies than it is to train data scientists on one's business. Except, in the case of MongoDB, odds are that enterprises are already filled with people that understand MongoDB, as 451 Research's LinkedIn analysis suggests: In sum, Big Data needn't be daunting or difficult. It's a download away.
MongoDB powers Mappy Health's tweet-based disease tracking
Pearson / OpenClass Uses MongoDB for Social Learning Platform
We recently spoke with Brian Carpio of Pearson about OpenClass , a new project from Pearson with deep Google integration. What is OpenClass? OpenClass is a dynamic, scalable, fully cloud-based learning environment that goes beyond the LMS. OpenClass stimulates social learning and the exchange of content, coursework, and ideas Ã¢â‚¬â€ù all from one integrated platform. OpenClass has all the LMS functionality needed to manage courses, but that's just the beginning. Why did you decide to adopt MongoDB for OpenClass? OpenClass leverages MongoDB as one of its primary databases because it offers serious scalability and improved productivity for our developers. With MongoDB, our developers can start working on applications immediately, rather than slogging through the upfront planning and DBA time that relational database systems require. Also, given that a big part of the OpenClass story will be how we integrate with both public and private cloud technologies, MongoDB support for scale-out, commodity hardware is a better fit than traditional scale-up relational database systems that generally must run on big iron hardware. Can you tell us about how you’ve deployed MongoDB? Currently we deploy MongoDB in our world-class datacenters and in Amazon's EC2 cloud environment with future plans to go to a private cloud technologies such as OpenStack. We leverage both Puppet and Fabric for deployment automation and rolling upgrades. We also leverage Zabbix and the mikoomi plugin for monitoring our MongoDB production servers. Currently each OpenClass feature / application leverages its own MongoDB replica set, and we expect to need MongoDB’s sharding features given the expected growth trajectory for OpenClass. What recommendations would you give to other operations teams deploying MongoDB for the first time? Automate everything! Also, work closely with your development teams as they begin to design an application that leverages MongoDB, which is good advice for any new application that will be rolled into production. I would also say to look at Zabbix as it has some amazing features related to monitoring MongoDB in a single replica set or in a sharded configuration that can help you easily identify bottlenecks and identify when it’s time to scale out your MongoDB deployment. Finally, I would suggest subscribing to the #mongodb irc channel, as well as the MongoDB Google Group , and don't be afraid to ask questions. I personally ask a lot of questions in the MongoDB Google Group and receive great answers not only from 10gen CTO Eliot Horowitz , although he does seem to answer a lot of my questions, but from a many other 10gen folks. What is in store for the future with MongoDB at Pearson? Our MongoDB footprint is only going to continue to grow. More and more development teams are playing with MongoDB as the foundation of their new application or OpenClass feature. We are working on migrating functionality out of both Oracle and Microsoft SQL Server to MongoDB where it makes sense to relieve the current stress on those incumbent database technologies. Thanks to Brian for telling us about OpenClass! Brian also blogs at www.briancarpio.com — be sure to check out his posts on MongoDB here and here and here and here and here . Tagged with: case study, Pearson, OpenClass, scalability, flexibility, ease of use