Welcome back to our 3-part blog series on creating operational data lakes:
- In part 1, we discussed the rise of the data lake, the role of Hadoop, and the challenges in integrating the data lake with operational applications
- In part 2, we covered the critical capabilities you need to evaluate in an operational database for your data lake, and a recommended technology design pattern for integrating the database with the data lake
- In this final installment, we’ll wrap up by covering real world examples and best practices from industry leaders
If you want to get a head-start and learn about all of these topics now, just go ahead and read the Operational Data Lake white paper.
Best Practice Case Studies
The following examples demonstrate how leading companies are using the design pattern discussed in part 2, and shown again below, to operationalize their data lakes.
Figure 1: Design pattern for operationalizing the data lake
UK’s Leading Price Comparison Site
Out-Innovating Competitors with MongoDB, Hadoop, Microservices, Docker, and the Cloud
The UK’s leading price comparison provider, and one of the country’s best known household brands has standardized on MongoDB Enterprise Advanced as the default operational database across its microservices architecture. The company’s online comparison systems need to collect customer details efficiently and then securely submit them to a number of different providers. Once the insurers' systems respond, the company can aggregate and display prices for consumers. At the same time, MongoDB generates real-time analytics to personalize the customer experience across the company’s web and mobile properties.
With the previous generation of systems, all application state was stored in the database, and then imported every 24 hours from backups into the company’s data warehouse. But that approach presented several critical issues:
No real-time insight as the analytics processes were working against aged data.
Application changes broke the ETL pipeline.
The management overhead increased as more applications were added and data volumes grew.
As the company transitioned to microservices, the data warehousing and analytics stacks were also modernized. While each microservice uses its own MongoDB database, the company needs to maintain synchronization between services, so every application event is written to a Kafka queue. Event processing runs against the queue to identify relevant events that can then trigger specific actions – for example customizing customer questions, firing off emails, presenting new offers and more. Relevant events are written to MongoDB, enabling the user experience to be personalized in real time as customers interact with the service.
All events are also written into Hadoop where they can be aggregated and processed with historical activity, in conjunction with additional customer data from the insurance providers. This enables the company to build enriched data views such as user profiles or policy offers. The models are then imported into the operational MongoDB databases to further enhance user experience, and maximize cross and upsell opportunities.
As a result of its modernized architecture, the company has established a leading position in the highly competitive price comparison market, while achieving 2x faster time to market after migrating from its former SQL Server relational database to MongoDB, and enabled continuous delivery to push new features live every day.
Leading Global Airline
Revenue Optimization with MongoDB, Spark, and Hadoop
Through a series of mergers and acquisitions, the airline’s customer data was scattered across 100 different systems. As a result, the company had no way to gain a single, 360 degree view of the business in order to analyze customer behavior, identify gaps in product portfolios, or present a consistent and personalized passenger experience across airline brands.
With its data lake built on Hadoop, the airline initially evaluated Apache HBase to serve operational applications, but found the column-oriented data model to be restrictive. The need to pre-define column families meant that any functional change in the online applications would break HBase’s single view schema. The lack of secondary indexes prevented the database from efficiently handling the array of queries needed for customer care applications and real-time analytics.
After further technology evaluation, the company has been able to bring together customer profiles into a single view stored in MongoDB, distributed across multiple data centers to service the online web, mobile and call center applications. All customer interactions, ticket sales and account data are processed and stored in MongoDB, and then written to the company’s Hadoop cluster where Spark machine learning jobs are run to build customer classifications, optimize ticket pricing and identify churn risks. These are then retrieved by MongoDB to serve the online applications. Spark processes are also run against the live operational data in MongoDB to update customer classifications and personalize offers in real time, as the customer is live on the web or speaking with the call center.
With MongoDB, Hadoop, and Spark powering its modern data architecture, the airline is meeting its goals of delivering personalized experiences to the millions of passengers it carries every year, while optimizing ticket prices and enhancing service offerings that reduce competitive threat.
Integrates Apache Spark and MongoDB to Unlock New Customer Insights for One of the World's Largest Banks
The Stratio Apache Spark-certified Big Data (BD) platform is used by an impressive client list including BBVA, Just Eat, Santander, SAP, Sony, and Telefonica. The company has implemented a unified real-time monitoring platform for a multinational banking group operating in 31 countries with 51 million clients all over the world. The bank wanted to ensure a high quality of service and personalized experience across its online channels, and needed to continuously monitor client activity to check service response times and identify potential issues. The application was built on a modern technology foundation including:
Apache Flume to aggregate log data Apache Spark to process log events in real time MongoDB to persist log data, processed events and Key Performance Indicators (KPIs).
The aggregated KPIs, stored by MongoDB enable the bank to analyze client and systems behavior in real time in order to improve the customer experience. Collecting raw log data allows the bank to immediately rebuild user sessions if a service fails, with analysis generated by MongoDB and Spark providing complete traceability to quickly identify the root cause of any issue.
The project required a database that provided always-on availability, high performance, and linear scalability. In addition, a fully dynamic schema was needed to support high volumes of rapidly changing semi-structured and unstructured JSON data being ingested from a variety of logs, clickstreams, and social networks. After evaluating the project’s requirements, Stratio concluded MongoDB was the best fit. With MongoDB’s query projections and secondary indexes, analytic processes run by the Stratio BD platform avoid the need to scan the entire data set, which is not the case with more simple datastores.
Working with some of the world’s largest enterprises, Stratio has seen data lakes growing in use, with MongoDB’s distributed design and dynamic schema a great fit as it is impossible to predict what type of data structures need to managed at scale.
Learn more by reading the Stratio case study.
Hadoop-based data lakes are enabling organizations to efficiently capture and analyze unprecedented volumes of data generated from connected devices and users. But without being able to expose that data to operational applications, users are struggling to maximize returns on their Hadoop investments. The longer it takes to surface insight to operational processes, the less valuable that insight is. With its flexible data model, powerful in-database analytics, distributed, scale-out architecture, and low latency performance, MongoDB provides the best solution to operationalize the data lake.
Learn more by reading the Operational Data Lake white paper.
About the Author - Mat Keep
Mat is director of product and market analysis at MongoDB. He is responsible for building the vision, positioning and content for MongoDB’s products and services, including the analysis of market trends and customer requirements. Prior to MongoDB, Mat was director of product management at Oracle Corp. with responsibility for the MySQL database in web, telecoms, cloud and big data workloads. This followed a series of sales, business development and analyst / programmer positions with both technology vendors and end-user companies.