5 results

Why Open Source Is Essential To Big Data

Gartner analyst Merv Adrian recently highlighted some of the recent movements in Hadoop Land, with several companies introducing products "intended to improve Hadoop speed." This seems odd, as that wouldn't be my top pick for how to improve Hadoop or, really, most of the Big Data technologies out there. By many accounts, the biggest need in Hadoop is improved ease of use, not improved performance, something Adrian himself confirms : Hadoop already delivers exceptional performance on commodity hardware, compared to its stodgy proprietary competition. Where it's still lacking is in ease of use. Not that Hadoop is alone in this. As Mare Lucas asserts , Today, despite the information deluge, enterprise decision makers are often unable to access the data in a useful way. The tools are designed for those who speak the language of algorithms and statistical analysis. It’s simply too hard for the everyday user to “ask” the data any questions – from the routine to the insightful. The end result? The speed of big data moves at a slower pace … and the power is locked in the hands of the few. Lucas goes on to argue that the solution to the data scientist shortage is to take the science out of data science; that is, consumerize Big Data technology such that non-PhD-wielding business people can query their data and get back meaningful results. The Value Of Open Source To Deciphering Big Data Perhaps. But there's actually an intermediate step before we reach the Promised Land of full consumerization of Big Data. It's called open source. Even with technology like Hadoop that is open source yet still too complex, the benefits of using Hadoop far outweigh the costs (financial and productivity-wise) associated with licensing an expensive data warehousing or analytics platform. As Alex Popescu writes , Hadoop "allows experimenting and trying out new ideas, while continuing to accumulate and storing your data. It removes the pressure from the developers. That’s agility." But these benefits aren't unique to Hadoop. They're inherent in any open-source project. Now imagine we could get open-source software that fits our Big Data needs and is exceptionally easy to use plus is almost certainly already being used within our enterprises...? That is the promise of MongoDB, consistently cited as one of the industry's top-two Big Data technologies . MongoDB makes it easy to get started with a Big Data project. Using MongoDB To Innovate Consider the City of Chicago. The Economist wrote recently about the City of Chicago's predictive analytics platform, WindyGrid. What The Economist didn't mention is that WindyGrid started as a pet project on chief data officer Brett Goldstein's laptop. Goldstein started with a single MongoDB node, and iterated from there, turning it into one of the most exciting data-driven applications in the industry today. Given that we often don't know exactly which data to query, or how to query, or how to put data to work in our applications, this is precisely how a Big Data project should work. Start small, then iterate toward something big. This kind of tinkering simply is difficult to impossible with a relational database, as The Economist's Kenneth Cukier points out in his book, Big Data: A Revolution That Will Transform How We Live, Work, and Think : Conventional, so-called relational, databases are designed for a world in which data is sparse, and thus can be and will be curated carefully. It is a world in which the questions one wants to answer using the data have to be clear at the outset, so that the database is designed to answer them - and only them - efficiently. But with a flexible document database like MongoDB, it suddenly becomes much easier to iterate toward Big Data insights. We don't need to go out and hire data scientists. Rather, we simply need to apply existing, open-source technology like MongoDB to our Big Data problems, which jibes perfectly with Gartner analyst Svetlana Sicular's mantra that it's easier to train existing employees on Big Data technologies than it is to train data scientists on one's business. Except, in the case of MongoDB, odds are that enterprises are already filled with people that understand MongoDB, as 451 Research's LinkedIn analysis suggests: In sum, Big Data needn't be daunting or difficult. It's a download away.

May 2, 2013

Missing the trees for the forest in Big Data

Yes, Big Data is a big deal. No, it’s not the be-all, end-all. And it’s certainly not going to replace real people anytime soon. Two articles serve to remind us of this. The first is David Brooks’ excellent New York Times op-ed piece, entitled “ What Data Can’t Do .” As cool as it is that the City of Chicago uses Big Data (and MongoDB) to improve services and lower crime, Brooks notes plenty of areas where data simply can’t solve our problems. As he notes: Data falls down on social analysis (“Computer-driven data analysis, on the other hand, excels at measuring the quantity of social interactions but not the quality”) Data struggles with context (“Data analysis is pretty bad at narrative and emergent thinking, and it cannot match the explanatory suppleness of even a mediocre novel”) Data creates bigger haystacks (Most “statistically significant correlations…are spurious and deceive us when we're trying to understand a situation”) Big data has trouble with big problems (“we've had huge debates over the best economic stimulus, with mountains of data, and as far as I know not a single major player in this debate has been persuaded by data to switch sides”) Data favors memes over masterpieces (A product that starts unpopular because it’s misunderstood won’t be divined through Big Data analysis) Data obscures values (“data is never raw; it's always structured according to somebody's predispositions and values”) In sum, data analysis doesn’t replace the rough and messy human reading of the world. Some data is helpful, but too much data simply blinds us to our human response to the world. We become paralyzed by the growing data hoard, and forget to follow the hunches we have based on imperfect data. Which brings us to the second article, Dave Mandl’s “ Big Data: Why it’s not always that big nor even that clever ” in The Register . Mandl’s basic point is that most businesses, most of the time, don’t need gargantuan quantities of net new data to make informed decisions. He writes: In the era of Big Data, larger-than-ever datasets are often cited as an issue that nearly everyone has to contend with, and for which the previous generation of tools is practically useless. But for the most part, [my Big Data peers] use Python scripts and C++. It's true that many huge data-consumers now make use of massively parallel architecture, clusters, and the cloud, but this move has been going on for more than a decade and, as my quant friend points out, ...people confuse doing things in the cloud with what you do in the cloud. Just because the data is in the cloud doesn't mean you're doing something different. Using distributed databases for speed and redundancy makes sense no matter what kind of work you're doing, given the ever-plummeting cost of hardware. As he concludes, “‘Data’ hasn't become the root of all evil overnight, any more than it's become the only thing that matters.” Even for so-called “Big Data” projects, we generally aren’t talking about massive quantities of data. SAP polled its customer base (which skews toward large enterprises) and found that more than half have data set sizes (for Big Data projects) of between 100TB and 500TB, and 24% only average 10TB. Big? Sure. But really big? Not really. That word, “big,” arguably causes more problems than it solves. Despite Gartner including volume, variety, and velocity of data in its definition of Big Data, we fixate on volume, when the size of our data set may often be the least interesting part of an enterprise’s Big Data strategy, and may blind us to inferences that emerge from more thoughtful, human analysis of smaller sets of data. Which is perhaps Gartner analyst Svetlana Sicular’s point when she argues that “Organizations already have people who know their own data better than mystical data scientists,” and simply need to be trained on the right tools, which increasingly means Hadoop and MongoDB . In other words, you’re in the right place. Image courtesy of Wim Vandenbussche under a Creative Commons license . Tagged with: Big Data, Gartner, Data Variety, Data velocity

February 20, 2013

The 'middle class' of Big Data

So much is written about Big Data that we tend to overlook a simple fact: most data isn’t big at all. As Bruno Aziza writes in Forbes , “it isn’t so” that “you have to be Big to be in the Big Data game,” echoing a similar sentiment from ReadWrite ’s Brian Proffitt . Large enterprise adoption of Big Data technologies may steal the headlines, but it’s the “middle class” of enterprise data where the vast majority of data, and money, is. There’s a lot of talk about zettabytes and petabytes of data, but as EMA Research highlights in a new study, “Big Data’s sweet spot starts at 110GB and the most common customer data situation is between 10 to 30TB.” Small? Not exactly But Big? No, not really. Couple this with the fact that most businesses fall into the 20-500-employee range , as Intuit CEO Brad Smith points out , and it’s clear that the biggest market opportunity for Big Data is within the big pool of relatively small enterprises with relatively small data sets. Call it the vast middle class of enterprise Big Data. Call it whatever you want. But it’s where most enterprise data sits. The trick is to first gather that data, and then to put it to work. A new breed of “data-science-as-a-service” companies like Metamarkets and Infochimps has arisen to lower the bar to culling insights from one’s data. While these tools can be used by enterprises of any size, I suspect they’ll be particularly appetizing to small-to-medium sized enterprises, those that don’t have the budget or inclination to hire a data science. (This might be the right way to go, anyway, as Gartner highlights : “Organizations already have people who know their own data better than mystical data scientists.” What they really need is access to the data and tools to process it.) Intriguingly, here at 10gen we’ve seen a wide range of companies, large and small, adopt MongoDB as they build out data-centric applications, but not always with Big Data in mind. In fact, while MongoDB and Hadoop are top-of-mind for data scientists and other IT professionals, as Wikibon has illustrated , many of 10gen’s smaller customers and users aren’t thinking about Big Data at all. Such users are looking for an easy-to-use, highly flexible data store for their applications. The fact that MongoDB also has their scalability needs covered is a bonus, one that many will unlock later into their deployment when they discover they’ve been storing data that could be put to use. In the RDBMS world, scale is a burden, both in terms of cost (bigger scale = bigger hardware = bigger license fees). Today, with NoSQL, scale is a given, allowing NoSQL vendors like 10gen to accentuate scalability with other benefits. It’s a remarkable turn of events for technology that emerged from the needs of the web giants to manage distributed systems at scale. We’re all the beneficiaries. Including SMBs. We don’t normally think about small-to-medium-sized businesses when we think of Big Data, but we should. SMBs are the workhorse of the world’s economies, and they’re quietly, collectively storing massive quantities of data. The race is on to help these companies put their comparatively small quantities of data to big use. It’s a race that NoSQL technologies like MongoDB are very well-positioned to win. Tagged with: MongoDB, big data, SMB, Hadoop, rdbms, Infochimps, Metamarkets, Gartner, Wikibon, data scientist

January 15, 2013

The data boom and NoSQL

Reading through Mary Meeker’s excellent 2012 KPCB Internet Trends Year-End Update , I was reminded by how critical NoSQL databases are to the present and future of application development. Even the most casual perusal of Meeker’s data indicates a critical need for new data stores that can handle Gartner’s 3 V’s of Big Data: velocity, volume, and variety. Importantly, as noted previously, such “V’s” aren’t restricted to some niche category of application. Going forward into the post-transactional future , the vast majority of applications will be better suited to a NoSQL database like MongoDB than to a legacy RDBMS. A few selections from Meeker’s slide deck explain why. First, the types of devices churning out copious quantities of data are proliferating. Mobile devices powered by Apple’s iOS and Google’s Android now surpass Windows personal computers. This translates into huge new data sources, all of which must be stored somewhere: Speaking of legacy, look at what has happened to communications, and how fast it happened: I remember back when it was cool to own a Motorola brick. Very few did, given the expense. But the point is not really mobile phones eventually got to a price point that they could compete with and then dominate the lowly landline, but rather how fast it happened. Rest assured, if mobile phones could unseat the landline in under 20 years, after the landline dominated for 125 years, the next wave will almost certainly take considerably less than 20 years to trounce the mobile phone. In this shift to mobile, and in subsequent shifts to other communication media, the variety, velocity, and volume of data will change dramatically. An RDBMS is simply incapable of handling such changes. How much data are we talking about? Meeker gives an answer: Smart enterprises are turning to MongoDB to future proof their applications. Rather than relying on rigid, fixed schema, as the RDBMS world requires, savvy developers are turning to ÃÂ_ber-flexible document databases, which allow very flexible schema. This is what The Guardian learned . The venerable UK news organization couldn’t adapt its business to embrace rich, dynamic content with its old-world relational database. By embracing MongoDB, The Guardian was able to embed user engagement into its services, but it also allows The Guardian to easily change its data model over time as business needs shift. The European Organisation for Nuclear Research, or CERN, for its part, relies on MongoDB to aggregate data from a complex array of different sources. Because it depends on dynamic typing of stored metadata, CERN couldn’t rely on an RDBMS with a fixed schema: Given the number of different data sources, types and providers that DAS connects to, it is imperative that the system itself is data agnostic and allows us to query and aggregate the meta-data information in customisable way. As data sources proliferate for all organizations, the volume, velocity, and variety of data will increase, sometimes exponentially. An RDBMS will likely prove useful for some tasks, but for the applications that really drive one’s business? Those are going to be NoSQL and, more often than not, MongoDB. - Posted by Matt Asay, vice president of Corporate Strategy. Tagged with: Big Data, Mary Meeker, mobile, landline, smartphone, NoSQL, MongoDB, Gartner, 3 V's, data velocity, data volume, data variety

December 5, 2012

10gen's NoSQL Leadership Continues to Capture Influencer Attention

We're excited to announce that 10gen has been profiled in Gartner's “Cool Vendors in Information Infrastructure and Big Data, 2012 report“ (April 2012). This year's report, authored by analysts Merv Adrian, Donald Feinberg and W. Roy Schulte, draws attention to the key emerging vendors in information infrastructure who are tackling various aspects of “big data“. Gartner identifies MongoDB's high-performance features as the right solution for ... CIOs, IT architects, and developers looking for agile development and flexible schema design that enhances developer productivity“. 10gen President Max Scherison comments: ...We consider our inclusion in the Cool Vendor report by Gartner confirmation of our mission to provide cutting edge NoSQL database technology to customers developing innovative large-scale applications and performing real-time 'big data analytics.'“ According to Gartner, NoSQL databases continue to be valuable for organizations looking to scale out cloud and on premise uses of numerous content types. Merv Adrian praises MongoDB for its unique feature set, including flexible schema, full consistency, 'no-downtime auto-sharding', replica sets, and more. For more information read the full press release . We will be posting Gartner's Cool Vendor report within the next week. Tagged with: gartner, Big Data, leadership, MongoDB, Mongo, NoSQL, Polyglot persistence, 10gen

April 27, 2012