Do Hadoop and the Cloud Mean the End of the Closed Data Warehouse?


The data warehouse business has been great for at least a decade as almost every sizable company has been finding itself awash in data that it could not bring itself to part with, and was willing to pay a lot to have it stored, organized, and analyzed. But in an era of open tools – such as Hadoop - plus a push to lower costs by using commodity hardware, has the clock started to tick down on closed data warehouses?

Data warehouses are fairly simple. They are built to accommodate a business’s data, often its structured data, although with a little tweaking unstructured data can be stored in them too.

Everything is usually kept inside a relational database and - as always with relational databases - the upside is speed of processing and the downsides are some inflexibility about kinds of data that can be handled along with heavy demands for costly hardware to run the SQL databases.

But as businesses look to cut costs, particularly with open source, and as commodity servers in the cloud become a new standard for running IT operations, this basic structure is beginning to be questioned.

Over the long term, private warehouses will continue to thrive, but they likely will make themselves open to manipulation by open tools such as Hadoop and NoSQL databases. It’s increasingly said that enterprise needs a new data fabric that accommodates the growing diversity of data types. Hybrid kinds of arrangements - where relational databases blend in with NoSQL databases - will likely emerge as solutions, at least in the short term.

But the other reality is that ours is an era of humongous data - really, really big data - and inevitably that will break many, perhaps not all, data warehouses as presently conceived. What’s keenly needed is big and fast data and the latter of course is not always easy to achieve with Hadoop, but there’s a stampede of plug-ins and tweaks designed to coax Hadoop into realtime, fast processing.

MongoDB of course already has announced integration with Hadoop that allows MongoDB data sets also to be manipulated by Hadoop and a plus is that data set size - for all practical purpose - ceases to be much of an issue. However big the data, bring it on.

The key of course is that - in their essences - Hadoop and MongoDB are fundamentally very different. MongoDB was written for data storage, not processing, and also data retrieval, while Hadoop was written to process data, particularly massive data sets. But those differences are why they can work together.

Processing can be done in Hadoop, then sent for storage in MongoDB, building on the strengths of each. And that is exactly the kind of strengths needed in a world where the new normal is really big data.

Ask yourself this, do you envision data warehousing needs diminishing, holding the same, or growing?

If the last, get to know Hadoop because it will become a part of your life. At least you will hope it does.