Hadoop-based data lakes are enabling enterprises and governments to efficiently capture and analyze unprecedented volumes of data generated from their digital transformation initiatives. But without being able to expose that data to operational applications, users are struggling to maximize returns on their Hadoop investments. The longer it takes to surface insight to operational processes, the less valuable that insight is, and the less competitive you are.
In this 3-part blog series, we’re going to cover:
- The rise of the data lake, the role of Hadoop, and the challenges in integrating the data lake with operational applications
- In part 2 we’ll cover the critical capabilities you need to evaluate in an operational database for your data lake, and a recommended technology design pattern for integrating the database with the data lake
- We’ll wrap up with part 3 covering real world examples and best practices from industry leaders
If you want to get a head-start and learn about all of these topics now, just go ahead and download the Operational Data Lake white paper.
The Rise of the Data Lake
The one thing no business lacks today is data – from streams of sensor readings, to social sentiment, to machine logs, mobile apps, and more. Analysts estimate data volumes growing at 40% per annum, with 90% of it unstructured. Uncovering new insights by collecting and analyzing this data carries the promise of competitive advantage and efficiency savings. However, the traditional Enterprise Data Warehouse (EDW) is straining under the load, overwhelmed by the sheer volume and variety of data pouring into the business, and then being able to store it in a cost-efficient way. As a result many organizations have turned to Hadoop as a centralized repository for this new data, creating what many call a data lake.
With its ability to store data of any structure without a predefined schema and scale-out on commodity hardware, Hadoop provides levels of performance, efficiency and low Total Cost of Ownership (TCO) unmatched by the EDW.
The Hadoop Distributed File System (HDFS) is designed for large scale batch processing. Providing a write-once, read-many, append-only storage model for unindexed data stored in files of up 128MB, HDFS is optimized for long running, sequential scans across TBs and PBs of data.
This makes Hadoop incredibly powerful at mining large swaths of multi-structured data to create analytics that companies can use to better inform their business. Example outputs can include:
- Customer segmentation models for marketing campaigns and eCommerce recommendations.
- Churn analysis for customer service representatives.
- Predictive analytics for fleet maintenance and optimization.
- Risk modeling for security and fraud detection.
These types of models are typically built from Hadoop queries executed across the data lake with latencies in the range of minutes and hours. However, the data lake, which excels at generating new forms of insight from diverse data sets, is not designed to provide real-time access to operational applications. Users need to make the analytic outputs from Hadoop available to their online, operational apps. These applications have specific access demands that cannot be met by HDFS, including:
- Millisecond latency query responsiveness.
- Random access to indexed subsets of data.
- Supporting expressive ad-hoc queries and aggregations against the data, making online applications smarter and contextual.
- Updating fast-changing data in real time as users interact with online applications, without having to rewrite the entire data set.
In our data-driven world, milliseconds matter. In fact, research from IBM observed that 60% of data loses its value within milliseconds of generation. For example, what is the value in identifying a fraudulent transaction minutes after the trade was processed? Furthermore, Gartner analysts predict that 70 percent of Hadoop deployments will not meet cost savings and revenue generation objectives due to skills and integration challenges.
Being able to generate and serve analytics from the data lake to online applications and users in real time can help address these challenges, demanding the integration of a highly scalable, highly flexible operational database layer. Ultimately, the companies that win in the future will not be those that have the largest data lakes. Rather it will be those who are the fastest in acting on the insights and intelligence that data itself creates. Operational databases are essential to executing on the data lake vision.
In part 2 of this blog series, we’ll look at the critical capabilities you need to consider when evaluating and selecting the operational database for your data lake.
Learn more by reading the Operational Data Lake white paper.
About the Author - Mat Keep
Mat is director of product and market analysis at MongoDB. He is responsible for building the vision, positioning and content for MongoDB’s products and services, including the analysis of market trends and customer requirements. Prior to MongoDB, Mat was director of product management at Oracle Corp. with responsibility for the MySQL database in web, telecoms, cloud and big data workloads. This followed a series of sales, business development and analyst / programmer positions with both technology vendors and end-user companies.