Data Variety

3 results

Missing the trees for the forest in Big Data

Yes, Big Data is a big deal. No, it’s not the be-all, end-all. And it’s certainly not going to replace real people anytime soon. Two articles serve to remind us of this. The first is David Brooks’ excellent New York Times op-ed piece, entitled “ What Data Can’t Do .” As cool as it is that the City of Chicago uses Big Data (and MongoDB) to improve services and lower crime, Brooks notes plenty of areas where data simply can’t solve our problems. As he notes: Data falls down on social analysis (“Computer-driven data analysis, on the other hand, excels at measuring the quantity of social interactions but not the quality”) Data struggles with context (“Data analysis is pretty bad at narrative and emergent thinking, and it cannot match the explanatory suppleness of even a mediocre novel”) Data creates bigger haystacks (Most “statistically significant correlations…are spurious and deceive us when we're trying to understand a situation”) Big data has trouble with big problems (“we've had huge debates over the best economic stimulus, with mountains of data, and as far as I know not a single major player in this debate has been persuaded by data to switch sides”) Data favors memes over masterpieces (A product that starts unpopular because it’s misunderstood won’t be divined through Big Data analysis) Data obscures values (“data is never raw; it's always structured according to somebody's predispositions and values”) In sum, data analysis doesn’t replace the rough and messy human reading of the world. Some data is helpful, but too much data simply blinds us to our human response to the world. We become paralyzed by the growing data hoard, and forget to follow the hunches we have based on imperfect data. Which brings us to the second article, Dave Mandl’s “ Big Data: Why it’s not always that big nor even that clever ” in The Register . Mandl’s basic point is that most businesses, most of the time, don’t need gargantuan quantities of net new data to make informed decisions. He writes: In the era of Big Data, larger-than-ever datasets are often cited as an issue that nearly everyone has to contend with, and for which the previous generation of tools is practically useless. But for the most part, [my Big Data peers] use Python scripts and C++. It's true that many huge data-consumers now make use of massively parallel architecture, clusters, and the cloud, but this move has been going on for more than a decade and, as my quant friend points out, ...people confuse doing things in the cloud with what you do in the cloud. Just because the data is in the cloud doesn't mean you're doing something different. Using distributed databases for speed and redundancy makes sense no matter what kind of work you're doing, given the ever-plummeting cost of hardware. As he concludes, “‘Data’ hasn't become the root of all evil overnight, any more than it's become the only thing that matters.” Even for so-called “Big Data” projects, we generally aren’t talking about massive quantities of data. SAP polled its customer base (which skews toward large enterprises) and found that more than half have data set sizes (for Big Data projects) of between 100TB and 500TB, and 24% only average 10TB. Big? Sure. But really big? Not really. That word, “big,” arguably causes more problems than it solves. Despite Gartner including volume, variety, and velocity of data in its definition of Big Data, we fixate on volume, when the size of our data set may often be the least interesting part of an enterprise’s Big Data strategy, and may blind us to inferences that emerge from more thoughtful, human analysis of smaller sets of data. Which is perhaps Gartner analyst Svetlana Sicular’s point when she argues that “Organizations already have people who know their own data better than mystical data scientists,” and simply need to be trained on the right tools, which increasingly means Hadoop and MongoDB . In other words, you’re in the right place. Image courtesy of Wim Vandenbussche under a Creative Commons license . Tagged with: Big Data, Gartner, Data Variety, Data velocity

February 20, 2013

The data boom and NoSQL

Reading through Mary Meeker’s excellent 2012 KPCB Internet Trends Year-End Update , I was reminded by how critical NoSQL databases are to the present and future of application development. Even the most casual perusal of Meeker’s data indicates a critical need for new data stores that can handle Gartner’s 3 V’s of Big Data: velocity, volume, and variety. Importantly, as noted previously, such “V’s” aren’t restricted to some niche category of application. Going forward into the post-transactional future , the vast majority of applications will be better suited to a NoSQL database like MongoDB than to a legacy RDBMS. A few selections from Meeker’s slide deck explain why. First, the types of devices churning out copious quantities of data are proliferating. Mobile devices powered by Apple’s iOS and Google’s Android now surpass Windows personal computers. This translates into huge new data sources, all of which must be stored somewhere: Speaking of legacy, look at what has happened to communications, and how fast it happened: I remember back when it was cool to own a Motorola brick. Very few did, given the expense. But the point is not really mobile phones eventually got to a price point that they could compete with and then dominate the lowly landline, but rather how fast it happened. Rest assured, if mobile phones could unseat the landline in under 20 years, after the landline dominated for 125 years, the next wave will almost certainly take considerably less than 20 years to trounce the mobile phone. In this shift to mobile, and in subsequent shifts to other communication media, the variety, velocity, and volume of data will change dramatically. An RDBMS is simply incapable of handling such changes. How much data are we talking about? Meeker gives an answer: Smart enterprises are turning to MongoDB to future proof their applications. Rather than relying on rigid, fixed schema, as the RDBMS world requires, savvy developers are turning to ÃÂ_ber-flexible document databases, which allow very flexible schema. This is what The Guardian learned . The venerable UK news organization couldn’t adapt its business to embrace rich, dynamic content with its old-world relational database. By embracing MongoDB, The Guardian was able to embed user engagement into its services, but it also allows The Guardian to easily change its data model over time as business needs shift. The European Organisation for Nuclear Research, or CERN, for its part, relies on MongoDB to aggregate data from a complex array of different sources. Because it depends on dynamic typing of stored metadata, CERN couldn’t rely on an RDBMS with a fixed schema: Given the number of different data sources, types and providers that DAS connects to, it is imperative that the system itself is data agnostic and allows us to query and aggregate the meta-data information in customisable way. As data sources proliferate for all organizations, the volume, velocity, and variety of data will increase, sometimes exponentially. An RDBMS will likely prove useful for some tasks, but for the applications that really drive one’s business? Those are going to be NoSQL and, more often than not, MongoDB. - Posted by Matt Asay, vice president of Corporate Strategy. Tagged with: Big Data, Mary Meeker, mobile, landline, smartphone, NoSQL, MongoDB, Gartner, 3 V's, data velocity, data volume, data variety

December 5, 2012