MongoDB and Stratio: Building an Operational Data Lake for One of Spain’s Largest Insurance Companies

Mat Keep

#Customer Stories#Business

MongoDB, Apache Spark, Zeppelin, Hadoop and Kafka Improve Customer Experience and Optimize Marketing ROI

Data lakes are playing an increasingly critical role in modern enterprise data architectures. Providing a centralized repository to aggregate data from multiple internal and external sources, the data lake – often built on Apache Hadoop – provides a foundation for sophisticated analytics that can improve customer insights, deliver higher operational efficiency and reduce costs.

Beyond just storing data, a key requirement of the data lake is being able to serve analytic models to real-time, operational applications. And that requires more than just Hadoop. Powerful operational databases are needed to make analytics models accessible and actionable within operational applications. Exposing these models to online applications makes our business processes smarter and more contextually aware – for example, presenting personalized recommendations to users, detecting and preventing fraudulent transactions while they are in flight, or predicting imminent failures in critical systems.

Mutua Madrileña, one of the largest insurance companies in Spain, recognized the importance of creating a data lake in its goal to improve customer experience and optimize marketing spend. They worked with the Pure Spark Platform from Stratio to implement an operational data lake, bringing together:

  • Apache Kafka and Apache Flume for data ingestion
  • Apache Hadoop for storing raw data
  • Apache Spark for analytical processing and machine learning, orchestrated by Apache Mesos
  • MongoDB and Postgres for serving analytics models to operational applications and reporting tools
  • R Studio and Apache Zeppelin for business intelligence and analytics
*Figure 1: Stratio Pure Spark Platform: creating an operationalized data lake*

Alvaro Santos, Big Data Architect at Stratio, presented the data lake built for Mutua Madrileña at this year’s MongoDB World conference. In his session, Alvaro discussed how the data lake is ingesting data from over 25 different sources to power a range of applications, including:

  • The creation of machine learning models to personalize user experience across web and mobile channels, present product recommendations, and classify insurance applicants by risk
  • Mapping of the customer journey through Mutua’s systems to understand user context and identify gaps in business processes
  • Collection and analysis of marketing campaign data to measure impact and improve performance.

Alvaro discussed the selection of technology for the data lake. Apache Spark was chosen because of its speed as a distributed processing engine, access to rich machine learning libraries, and developer-friendly APIs. MongoDB was chosen:

  1. To take advantage of its flexible data model, allowing rapidly changing, semi-structured data to be stored and processed with ease
  2. Rich secondary indexes that push query filtering down to the database. As a result, operational applications can execute complex queries with low latency
  3. Automatic sharding to support a doubling of data volumes in the data lake, and native replication for always-on availability

View the slides to learn more about Mutua’s data lake journey:

Learn more about unlocking operational intelligence from the data lake with our new whitepaper:

Download the white paper