Mike Olson On The Past And Future Of Data
Data today is very big, but it's not because any particular individual or company is creating lots and lots of data. Instead, we live in a new machine age, with a vast proliferation of machines emitting data in volumes and variety the world has never seen. As such, no single company will be big enough to tackle Big Data alone, declared Cloudera co-founder and chief strategy officer Mike Olson in his MongoDB World keynote this week in New York City. There Is No Big Bang Big Data isn't about Big Companies or other single sources of data. U.S. homes now hold over 500 million Internet-connected devices at an average of 5.7 per household, according to NPD. By 2017 each person will have 5 Internet-connected devices, with each one contributing to a torrent of data. In the past, Olson indicated, we built big, centralized databases, which were good at a managing data created at human scale. They were awesome for their generation. But they’re simply not good enough for the world of machine-generated data, i.e., the world we live in now. These relational databases were designed for a world that didn’t need to account for the incredible variety and petabyte-scale of machine-generated data. Google introduced us into a new, small flexible incremental architecture, which gave us a new way to think about hardware and software and, really, a new way to think about data. Google also gave us a new way of thinking about how to capture, store and analyzing data. That "new way" is the cloud. As he stressed, data will tend to remain where it was generated. Given that the vast majority of new data is created in or for the cloud, modern databases must also live in the cloud. One Database To Rule Them All? While it's tempting to think history will repeat itself and one company will dominate Big Data, such a zero sum, one winner-takes-all outcome is unlikely. The reason? The power of a data hub or platform derives from its ability to collect data from small, disparate systems, loosely coupled, rather than from owning the Big Data "stack." Hence, while Olson at one time thought Cloudera, MongoDB and Teradata would fiercely compete to manage the same data, the reality is that the three companies now work closely together to take care of data at all points in the data lifecycle. Big Data is not zero sum. It's not created by any single entity, and it can't be controlled by any single entity. A community is required. As both Olson and MongoDB CEO Max Schireson insisted in their respective keynotes, that community is comprised of Cloudera and MongoDB working together to solve customers' biggest Big Data problems. To see all MongoDB World presentations, visit the [MongoDB World Presentations](https://www.mongodb.com/mongodb-world/presentations) page.
MongoDB And Teradata Join Forces To Make Big Data Smart
As enterprises increasingly depend on MongoDB to build and run modern applications, they need high-quality analytics solutions to match MongoDB's powerful data model. With the partnership Teradata and MongoDB just announced , they just got one. And it's exceptionally cool. With data analytics leader Teradata we've built a bi-directional connector that gives organizations interactive data processing at extremely fast speeds. Teradata's bi-directional QueryGrid connector allows Teradata customers to integrate massive volumes of JSON with cross-organizational data in the data warehouse for high performance analytics. Through the connector, MongoDB customers will have access to JSON that has been enriched by Teradata to support rapidly evolving applications for mobile, Internet of Things, eCommerce, social media and other applications. In other words, users will soon be able to easily connect MongoDB applications and analytics running on Teradata. The Future Is JSON For the past 40 years, enterprises have stored their data in the tidy-but-rigid tables and joins of relational databases. Given the explosion of unstructured data, however, enterprises need a more expressive, flexible way of describing and storing data. Enter JSON. MongoDB stores data in JSON documents, which we serialize to BSON . JSON provides a rich data model that seamlessly maps to native programming language types, and the dynamic schema makes it easier to evolve one's data model than with a system that enforces schemas like a relational database (RDBMS). Marrying MongoDB's operational database with Teradata's analytics platform a great way to bring together all of an enterprise's data. A Virtuous Cycle One way of thinking about the interaction between MongoDB and Teradata is to picture a crowd of people. MongoDB interacts with individuals within the crowd in real-time while Teradata looks for patterns within the crowd. With this connector, organizations can push their MongoDB data (website clicks, purchases, etc.) into Teradata, which runs queries against the data, looking for patterns. This intelligence is then pushed back to MongoDB, enriching the interaction with individual eCommerce buyers, mobile users, etc. It's a virtuous cycle, as Teradata describes on its blog . Here's what this looks like for an eCommerce application: By bringing the two together, an eCommerce vendor's interactions with its customers will continuously improve as their MongoDB-based application gets smarter and more tailored by Teradata analytics. Importantly, for enterprises that expect to use both relational databases and MongoDB, Teradata's JSON integration unifies relational and MongoDB data analysis. And, Not Or This last point is worth repeating. As much as enterprises might wish to shed their IT investments and start over, the reality is that they can't and won't, as a 2012 Gartner analysis found: By giving organizations an easy way to connect MongoDB's operational data with Teradata's enterprise data warehouse, the two organizations ensure existing and new data sources can coexist. By working closely together, MongoDB and Teradata give enterprises the best of a modern, operational database with a powerful analytics platform.
You Know What's Cool? 1 Trillion Is Cool
A million used to be cool. Then Facebook upped the ante to one billion. But in our world of Big Data, even a billion is no longer the upper end of scale, or cool. As I learned last night, at least one MongoDB customer now stores over 1 trillion documents in MongoDB. 1 trillion . That's cool. It's also far bigger than any other database deployment I've seen from any NoSQL or relational database, even from the simple key-value or columnar data stores that are only programmed to handle simple workloads, but to scale them well. That's what makes MongoDB über cool: not only does it offer dramatic, superior scale , but it does so while also giving organizations the ability to build complex applications. MongoDB delivers the optimal balance between functionality and performance, as this illustrates: Many systems are focused on nothing more than storing your data, and letting you access it quickly, but one and only one way. This simply isn’t enough . A truly modern database must support rich queries, indexing, analysis, aggregation, geospatial access and search across multi-structured, rapidly changing data sets in real time. The database must not trap your data and hinder its use. It must unleash your data . All 1 trillion documents of it. Want to see how major Global 2000 organizations like Bosch, U.S. Department of Veterans Affairs, Genentech, Facebook and many others scale with MongoDB? Easy. Just register to attend MongoDB World, June 24-25 in New York City. You can use my discount code to get 25% off: 25MattAsay.
Big Blue Understands Big Data Is Often Little But Moves Fast
IBM has invested a great deal of money to harness massive volumes of data. Yet it's telling that in a post about Big Data today, the company chooses to highlight the even greater importance of velocity of data: True innovators are finding value in even the smallest bytes of data that move very rapidly into and out of the organization. That’s because most organizations will overlook these opportunities, wrongly thinking that because data moves too quickly and can’t be stored, there is no way to analyze it. Analyzing data in motion and capitalizing in the moment is the secret to success in the era of big data. This is where stream computing comes into play. Stream computing changes where, when and how much of your business data you can analyze. By extracting insight from data as it is in motion, you can react to events as they are happening to reshape business outcomes. Store less, analyze more, and make better decisions, faster. From increased customer retention to earlier fraud detection to more frequent cross-selling, the benefits of stream computing are many. While MongoDB does a great job with data in copious quantities, arguably the better reason to use MongoDB is its ability to process streaming data. We're therefore glad to be working with IBM's InfoSphere team on ways to protect such sensitive, fast-moving corporate data.
Big Data Is The New Normal
Now Big Data has even won a Gartner Seal of Approval, so to speak, with the publication of a new report that says Big Data is on the fast track to maturity and by 2016 it will just be data. The huge idea here is that as information goes cross-platform and vaults up in volume, very shortly Big Data will become another norm of how business gets done. By no means will all data be Big Data - that would make no sense because there of course are key values found in certain, specific data warehouses that are quite small and structured. But there also is the fact that, suddenly, business realizes it is awash in data and it knows that if only it can harness the insights that can be derived from the information already on hand or soon to be, great value will ensue. Big data transforming bricks + mortar retailing Think about the recent New York Times story about how pioneering retailers are tracking customer behavior by monitoring cellphone signals -- WiFi in particular. The story triggered significant teeth gnashing by privacy proponents, and these concerns are understandable. But a reality is that e-commerce players already have enormous tracking data on their customers and it stands to reason that, finally, bricks and mortar retailers would want to level the playing field. Big data plus smartphones is an equation that works. Put the privacy debate aside. Think simply about the information flow, its volume and the insights it would give retailers. That is very big data indeed and the goal would be mashing it up and then reinventing store layout, product placement, and in effect making it easier for consumers to find and buy what they came into the store for. Really knowing the customer Probably a gold standard for pursuing BIg Data is Netflix, which is well known for gathering information about what their customers watch but also how they watch it - where do they fast forward, where do they rewind, when do they simply turn off a film and never return? But then Netflix goes farther with its data, per reporting in SiliconANGLE : “[Netflix] actually [is] putting that data to use. Netflix has begun to produce its own original TV shows, and to do so its leveraging all of its data to do it. Netflix used its data to decide that the BBC’s ‘House of Cards’ was the best fit for a remake, and its data also correlated fans of the original to fans of actor Kevin Spacey and director David Fincher, which in turn was what led to them being hired.” Think about the power there: Big data is driving complex decisions and, apparently, it is helping get closer to what consumers really want. The maturation of Big Data A safe bet is that such stories - revolutionary as they sound in 2013 - will seem commonplace within a very few years because, right now, the ingredients are all coming together for a flowering of Big Data into an everyday business tool and probably 2016, as Gartner predicts, is as good a guess as any. Certainly it will become more commonplace in many more businesses very soon, about now in fact as the first generation pioneers - with their massive data stores and new analytical tools for rapidly making sense out of them -- start to enjoy differentially superior results. They are demonstrating that Big Data works, period. It’s no longer a computer science project, it’s becoming just plain business. The irony is that when Big Data becomes humdrum - when it loses its buzzword status - that is when it genuinely will have solidified a role as a transformational information utility. By, say, 2020 we will look back and be puzzled at how organizations arrived at their marketing and design decisions without Big Data. It will seem every bit as puzzling as, say, how organizations maintained customer data befor CRM (can you say 3” x 5” card?). From where we sit in 2013, that future seems distant indeed. But it also is closer than we think.
The Big Data Hoax That Wasn't
Welcome to the Age of Big Data. Or perhaps it’s the Age of Big Data Agnosticism. In a Newtonian twist, what started as a wave of hype for data’s transformational potential on organizations everywhere has turned into an equal and opposite backlash of big data naysaying. It is an understandable reaction to the great over-selling of big data as a kind of enterprise cure-all. Of course, in some companies, big data pilots have produced nothing but big piles of unfulfilled expectations. But the problem likely is not big data. Big data remains potentially the most powerful engine for business transformation to gain currency in the 21st century. The problem is that so much of what is sold as big data isn’t. It’s typically just lots of data. “Big data, that’s just data mining with a fancy new name.” How often have you heard that? It’s flatly false. The size or volume of the data does not matter in genuine big data analytics. Instead, savvy organizations already understand that big data is really about working with a mix of data types - structured and unstructured, from inside the organization and outside. It is CRM forms, but it also is Tweets, Facebook posts, TripAdvisor rants, Gmails, Outlook entries, even voicemail. In most organizations this does not add up to petabytes of data, as I’ve written before . Terabytes is the usual quantity even though that seems small by many measures. The complexity arises in the diversity of data. And that raises a problem. Not many databases have the flexibility to handle that many forms of data. And fewer databases have the agility to permit modifications on the fly - “Shouldn’t we add SMS data in here, too?” The right answer is, done. A database that cannot - with little fuss -- add a new row is too rigid for use in true big data analysis because the exciting - maybe maddening? - bit about big data today is that always there is new input that may enhance the overall result. Then there are the other questions: why are you collecting big data in the first place? What do you want from your analysis of it and this question is key because without targeted analytics, big data is just hoarding. As an insightful story in The Guardian recently posited, “Companies need to focus on big answers not big data. Instead of focusing upon the concept of big data, organizations should concentrate on the intelligence data can offer.” In other words, it’s not about the data: it’s about what intelligence can be drawn from it. The Guardian author calls himself a “big data sceptic” but, really, he isn’t. He just shares the frustration over the many mislabeled big data projects - that never were about big data - and also about the data hoarding that some companies do when they say they are committing to big data. Such projects rarely end well. Real big data - unstructured, from multiple sources - coupled with real analytics is a game changer that gives forward-thinking organizations insight that before was merely guesswork. One Texas city ran analyses to determine exactly what happened in parts of the city that experienced higher than anticipated growth and a resulting increase in value. This was true big data. In the mix were police reports, zoning violations, construction permits, parking tickets, you name it. If the data existed, it was fed into the analysis and the city began to see what it did - and didn’t do - to spur growth. Where could it get out of the way? Where could it proactively spur growth? It was real big data in action. And it’s why big data remains a big deal, despite the hype.
Born for Big Data
I recently spoke on an interesting panel at Red Hat Summit entitled " Big Data and Traditional Databases ," along with representatives from the MySQL, PostgreSQL and Sybase/SAP communities. At one point, the moderator asked, "What features are your adding to your roadmap to make your database ready for Big Data." I was stumped. Unlike the relational databases, MongoDB was born as a Big Data database. This isn't a secret. Whether measured by how IT professionals talk about MongoDB on Twitter, Big Data-related job postings or other means, MongoDB is always rated one of the industry's top-two Big Data technologies . Which isn't to suggest that 10gen and the MongoDB community are sitting still. We've been adding Big Data-relevant functionality like full-text indexing and continue to improve the already great performance, among other things. But the essential ability to handle a wide variety of data types and sources, in real time, at significant scale? That comes ready out of the box. 10gen has never intended for MongoDB to be solely a "Big Data database." That would be too narrow. MongoDB is a general-purpose database designed for and comfortable managing a broad and growing array of applications. Some will argue that all applications will be Big Data applications going forward. Perhaps. Regardless, MongoDB has it covered.
Big Data Is A Matter Of Speed, Not Size
Finally the market is getting over its initial BIG Data fixation. Unfortunately, in the process we may be inclined to throw away the Big Data signal in an attempt to rid ourselves of all the noise. The Guardian 's John Burn-Murdoch highlights this today, asserting that "'small data' - or data of the volumes most regular analysts, researchers and statisticians are used to dealing with - is actually both more relevant and more useful to the vast majority of organisations than its big cousin." He concludes, "[I]t is speed, not size that is increasingly driving desire for software and hardware improvements at data-processing organisations." While we talk about Big Data, the reality is that there is a much more important trend going on in data, generally, as Rufus Pollock, Founder and Co-Director of the Open Knowledge Foundation, captures : [W]e risk overlooking the much more important story here, the real revolution, which is the mass democratisation of the means of access, storage and processing of data. This story isn't about large organisations running parallel software on tens of thousand of servers, but about more people than ever being able to collaborate effectively around a distributed ecosystem of information, an ecosystem of small data. Now if only we could get everyone else to recognize this essential truth, so we could stop admiring how very big all our data is, and instead focus on actually putting it to work in time for it to be useful to us.
Data Scientist Shortage? There's An App For That
Big Data is all the rage, but apparently will come to a crashing halt due to a shortage of data scientists. As I've argued elsewhere , this is mostly a sham. Context is critical for making use of a company's data, and the people with context already work for the enterprise. So it becomes a matter of training the people one has, rather than going off on a scouting trip for the mythical data scientist. Nor will the "science" of Big Data remain such for long, according to IBM's James Kobielus . As he notes, "core data scientist aptitudes -- curiosity, intellectual agility, statistical fluency, research stamina, scientific rigor, skeptical nature -- are widely distributed throughout workforces everywhere." He then points to a few key trends that will make data science less of a science: As more data discovery, acquisition, preparation, and modeling functions are automated through better tools, today's data scientists will have more time for the core of their jobs: statistical analysis, modeling, and interaction exploration. Data scientists are developing fewer models from scratch. That's because more and more big data projects run on application-embedded analytic models integrated into commercial solutions.... Open source communities and tools will greatly expand the pool of knowledgeable, empowered data scientists at your disposal, either as employees or partners. This jibes with Cloudera CEO Mike Olson's contention that "There will be enormous Hadoop adoption, but you'll get it by virtue of the applications you run." But whether an organization interprets its data through applications or directly using open-source technologies, one thing that remains true in all this: people are critical to making sense of Big Data. The data won't speak for itself. It's therefore critical to find people inside one's organization who can help make sense of the organization's data. The good news? They're already available and on the payroll.