Why Open Source Is Essential To Big Data
Gartner analyst Merv Adrian recently highlighted some of the recent movements in Hadoop Land, with several companies introducing products "intended to improve Hadoop speed." This seems odd, as that wouldn't be my top pick for how to improve Hadoop or, really, most of the Big Data technologies out there. By many accounts, the biggest need in Hadoop is improved ease of use, not improved performance, something Adrian himself confirms : Hadoop already delivers exceptional performance on commodity hardware, compared to its stodgy proprietary competition. Where it's still lacking is in ease of use. Not that Hadoop is alone in this. As Mare Lucas asserts , Today, despite the information deluge, enterprise decision makers are often unable to access the data in a useful way. The tools are designed for those who speak the language of algorithms and statistical analysis. It’s simply too hard for the everyday user to “ask” the data any questions – from the routine to the insightful. The end result? The speed of big data moves at a slower pace … and the power is locked in the hands of the few. Lucas goes on to argue that the solution to the data scientist shortage is to take the science out of data science; that is, consumerize Big Data technology such that non-PhD-wielding business people can query their data and get back meaningful results. The Value Of Open Source To Deciphering Big Data Perhaps. But there's actually an intermediate step before we reach the Promised Land of full consumerization of Big Data. It's called open source. Even with technology like Hadoop that is open source yet still too complex, the benefits of using Hadoop far outweigh the costs (financial and productivity-wise) associated with licensing an expensive data warehousing or analytics platform. As Alex Popescu writes , Hadoop "allows experimenting and trying out new ideas, while continuing to accumulate and storing your data. It removes the pressure from the developers. That’s agility." But these benefits aren't unique to Hadoop. They're inherent in any open-source project. Now imagine we could get open-source software that fits our Big Data needs and is exceptionally easy to use plus is almost certainly already being used within our enterprises...? That is the promise of MongoDB, consistently cited as one of the industry's top-two Big Data technologies . MongoDB makes it easy to get started with a Big Data project. Using MongoDB To Innovate Consider the City of Chicago. The Economist wrote recently about the City of Chicago's predictive analytics platform, WindyGrid. What The Economist didn't mention is that WindyGrid started as a pet project on chief data officer Brett Goldstein's laptop. Goldstein started with a single MongoDB node, and iterated from there, turning it into one of the most exciting data-driven applications in the industry today. Given that we often don't know exactly which data to query, or how to query, or how to put data to work in our applications, this is precisely how a Big Data project should work. Start small, then iterate toward something big. This kind of tinkering simply is difficult to impossible with a relational database, as The Economist's Kenneth Cukier points out in his book, Big Data: A Revolution That Will Transform How We Live, Work, and Think : Conventional, so-called relational, databases are designed for a world in which data is sparse, and thus can be and will be curated carefully. It is a world in which the questions one wants to answer using the data have to be clear at the outset, so that the database is designed to answer them - and only them - efficiently. But with a flexible document database like MongoDB, it suddenly becomes much easier to iterate toward Big Data insights. We don't need to go out and hire data scientists. Rather, we simply need to apply existing, open-source technology like MongoDB to our Big Data problems, which jibes perfectly with Gartner analyst Svetlana Sicular's mantra that it's easier to train existing employees on Big Data technologies than it is to train data scientists on one's business. Except, in the case of MongoDB, odds are that enterprises are already filled with people that understand MongoDB, as 451 Research's LinkedIn analysis suggests: In sum, Big Data needn't be daunting or difficult. It's a download away.
NoSQL is the new normal (the video!)
As enterprises become more data-driven, NoSQL is increasingly the "new normal" for data infrastructure. 10gen vice president Matt Asay presented on this topic at QCon in London ( video and slides available here ), and will be presenting a webinar on this topic on Tuesday, April 23. Please register today to learn why NoSQL is the new normal...including within your enterprise.
Making sense of increased database choice
Gartner estimates that by 2015, 25 percent of new databases deployed will be of technologies supporting alternative data types and non-traditional data structures. This is great news, as these new database choices, many of them NoSQL, are generally better tuned to modern application requirements. The downside to this end to the “30-year old freeze,” to quote Redmonk analyst James Governor , is that with all these new options comes the risk of complicating a hitherto somewhat simple choice: which database to use? DB-Engines, after all, lists and ranks 92 different database systems , which doesn’t even include all of the NoSQL variants . Good luck to the CIO who tries to deploy all of those within her enterprise. The key, then, is to figure out how to standardize on a core of database technologies. Most companies will want to retain their legacy relational database for applications tuned to an RDBMS, or perhaps require complex transactions. But for most new applications, NoSQL databases like MongoDB will be be the optimal solution. But which one? There are currently at least 150 different NoSQL databases, split into different camps: document, columnar, key-value, graph, and others. One of my favorite guides for differentiating between these different options is Pramod Sadalage and Martin Fowler’s NoSQL Distilled . It does a great job of making NoSQL approachable, and also offers some guidance on which type of database to apply to specific types of problems. This is critical: which database is best largely depends on a particular use case. There is no shortage of guidance as to whether an enterprise should use NoSQL or stick with RDBMS or, if NoSQL, which to use ( here’s just one of many sites offering guidance). Unfortunately, this still doesn’t cut down on the number of choices presented to a developer interested in selecting a database for her application. I’m sure much of the advice is good, but it could end up solving a point problem (which database to use for a particular application) but exacerbate the meta problem (which databases to standardize on throughout the enterprise). This should be top-of-mind for every CIO, as shadow IT is already bringing NoSQL databases into the enterprise. This trend is only going to accelerate, as InfoWorld’ s Bob Lewis notes . The reasons NoSQL technologies are being adopted into the enterprise are somewhat similar to the reasons shadow IT is embracing the public cloud: speed of development, ease of development, and suitability for modern applications, as a recent Forrester survey found : Hence, savvy CIOs will select a few, broadly applicable databases that can tackle the vast majority of enterprise needs, while simultaneously satiating developers’ needs for databases that help them get their work done. But, again, which ones? Most enterprises already have RDBMS preferences, standardizing on two and possibly three SQL databases. Part of the reason that these databases have served so many for so long is that they are general purpose databases. They might not be the absolute perfect solution to a particular application requirement, but they do the job well enough and help the enterprise focus its resources. When choosing a NoSQL database, and every enterprise is going to need to do this, it’s important to opt for NoSQL databases that solve a wide variety of problems, rather than addressing niche requirements with a narrowly-applicable database. Document data stores like MongoDB tend to be the most broadly applicable, able to tackle a wide array of workloads. But there are other NoSQL databases that while not as generally useful, do a few things really well and should be considered. Other things to consider in settling on database standards are political and cultural issues, compatibility with existing applications or applications on the near- and long-term roadmap, and the momentum behind a particular NoSQL database. With 150-plus NoSQL databases to choose from, picking a fashionable but ephemeral database is a recipe for frustration and failure. As I’ve written, MongoDB’s community size and momentum , among other things, suggests it will be around for a long, long time. But there are other NoSQL communities that also demonstrate staying power. No enterprise wants to be managing dozens of databases, or even 10. Ideally, enterprises will settle on a few. Perhaps five, at most. In so doing, they should look to augment their RDBMS standards with NoSQL databases that are general purpose in nature, and broadly adopted. Considered in this light, NoSQL database standardization becomes much more manageable. — Posted by Matt Asay, vice president of Corporate Strategy . Tagged with: MongoDB, NoSQL, RDBMS, choice, database, relational database, Forrester, standardization, InfoWorld, shadow IT, Matt Asay