Missing the trees for the forest in Big Data
Yes, Big Data is a big deal. No, it’s not the be-all, end-all. And it’s certainly not going to replace real people anytime soon.
Two articles serve to remind us of this.
The first is David Brooks’ excellent New York Times op-ed piece, entitled “What Data Can’t Do.” As cool as it is that the City of Chicago uses Big Data (and MongoDB) to improve services and lower crime, Brooks notes plenty of areas where data simply can’t solve our problems.
As he notes:
- Data falls down on social analysis (“Computer-driven data analysis, on the other hand, excels at measuring the quantity of social interactions but not the quality”)
- Data struggles with context (“Data analysis is pretty bad at narrative and emergent thinking, and it cannot match the explanatory suppleness of even a mediocre novel”)
- Data creates bigger haystacks (Most “statistically significant correlations…are spurious and deceive us when we're trying to understand a situation”)
- Big data has trouble with big problems (“we've had huge debates over the best economic stimulus, with mountains of data, and as far as I know not a single major player in this debate has been persuaded by data to switch sides”)
- Data favors memes over masterpieces (A product that starts unpopular because it’s misunderstood won’t be divined through Big Data analysis)
- Data obscures values (“data is never raw; it's always structured according to somebody's predispositions and values”)
In sum, data analysis doesn’t replace the rough and messy human reading of the world. Some data is helpful, but too much data simply blinds us to our human response to the world. We become paralyzed by the growing data hoard, and forget to follow the hunches we have based on imperfect data.
Which brings us to the second article, Dave Mandl’s “Big Data: Why it’s not always that big nor even that clever” in The Register. Mandl’s basic point is that most businesses, most of the time, don’t need gargantuan quantities of net new data to make informed decisions.
In the era of Big Data, larger-than-ever datasets are often cited as an issue that nearly everyone has to contend with, and for which the previous generation of tools is practically useless.
But for the most part, [my Big Data peers] use Python scripts and C++. It's true that many huge data-consumers now make use of massively parallel architecture, clusters, and the cloud, but this move has been going on for more than a decade and, as my quant friend points out, ...people confuse doing things in the cloud with what you do in the cloud. Just because the data is in the cloud doesn't mean you're doing something different. Using distributed databases for speed and redundancy makes sense no matter what kind of work you're doing, given the ever-plummeting cost of hardware.
As he concludes, “‘Data’ hasn't become the root of all evil overnight, any more than it's become the only thing that matters.”
Even for so-called “Big Data” projects, we generally aren’t talking about massive quantities of data. SAP polled its customer base (which skews toward large enterprises) and found that more than half have data set sizes (for Big Data projects) of between 100TB and 500TB, and 24% only average 10TB. Big? Sure. But really big? Not really.
That word, “big,” arguably causes more problems than it solves. Despite Gartner including volume, variety, and velocity of data in its definition of Big Data, we fixate on volume, when the size of our data set may often be the least interesting part of an enterprise’s Big Data strategy, and may blind us to inferences that emerge from more thoughtful, human analysis of smaller sets of data.
Which is perhaps Gartner analyst Svetlana Sicular’s point when she argues that “Organizations already have people who know their own data better than mystical data scientists,” and simply need to be trained on the right tools, which increasingly means Hadoop and MongoDB.
In other words, you’re in the right place.
Image courtesy of Wim Vandenbussche under a Creative Commons license.