MongoDB and Apache Spark at China Eastern Airlines: Delivering 100x Performance Improvements

Mat Keep

#scale#Business

New MongoDB Connector for Apache Spark Enables New Fare Calculation Engine, Supporting 180m Fares and 1.6 billion Queries per Day, Migrated off Oracle.

As one of the world’s largest airlines, China Eastern constantly explores emerging technologies to identify new ways of improving customer experience and reducing cost.

Earlier this year, I interviewed its engineering leaders to learn more about the migration of China Eastern’s flight search platform from Oracle to MongoDB. As a result of the migration, China Eastern achieved orders-of-magnitude performance improvements, and enabled the delivery of application features that had never before been possible.

The next stage of China Eastern’s initiative to transform user experience has been to address airline fare calculations, and serve them reliably, at scale, to the migrated search application.

At this year’s MongoDB World user conference, Chong Huang, Lead Architect at China Eastern Airlines, presented details on their new fare calculation engine built with Apache Spark and MongoDB to New MongoDB Connector for Apache Spark Enables New Fare Calculation Engine, Supporting 180m Fares and 1.6 billion Queries per Day, Migrated off Oracle support more than 1.6 billion queries per day.

The Challenge

Based on averages used across the airline industry, 12,000 searches are needed to generate a single ticket sale. China Eastern currently sells 260,000 airline seats every day, and is targeting 50% of those to come from its online web and online channels. Selling 130,000 seats equates to 1.6 billion searches and fare requests per day, or just under 20,000 searches a second. However, the current fare calculation system built on the Oracle database supports only 200 searches per second. China Eastern’s engineers realized they needed to radically re-architect their fare engine to meet the required 100x growth in search traffic.

The Solution

Rather than calculate fares for each search in real time, China Eastern decided to take a different approach. In the new system, they pre-compute fares every night, and then load those results to the database for access by the search application.

The flight inventory data is loaded into an Apache Spark cluster, which calculates fares by applying business rules stored in MongoDB. These rules compute fares across multiple permutations, including cabin class, route, direct or connecting flight, date, passenger status, and more. In total, 180 million prices are calculated every night! The results are then loaded into MongoDB where they are accessed by the search application.

Why Apache Spark?

The fare engine demands complex computations against large volumes of flight inventory data. Apache Spark provides an in-memory data processing engine that can be distributed and parallelized across multiple nodes to accelerate processing. Testing confirmed linear scalability as CPU cores were added to the cluster. As a Java shop, Apache Spark’s Java API also provides development simplicity for the engineering team.

Why MongoDB?

China Eastern had already implemented a successful project with MongoDB. They knew the database would scale to meet the throughput and latency needs of the fare engine, and they could rely on expert support from MongoDB engineers. The database’s flexible data model would allow multiple fares to be stored for each route in a single document, accessed in one round trip to the database, rather than having to JOIN many tables to serve fares from their Oracle database.

Figure 1: Fare Engine Architecture
Why the MongoDB Connector for Apache Spark?

Just as the project was starting, MongoDB announced the new Databricks-certified MongoDB Connector for Apache Spark.

Using the Connector, Spark can directly read data from the MongoDB collection and turn it into a Spark Resilient Distributed Dataset (RDD), against which transformations and actions can be performed. Internally the Connector uses the splitVector command to create chunk splits, and each chunk can then be assigned to one Spark worker for processing. Data locality awareness in the Connector ensures RDDs are co-located with the associated MongoDB shard, thereby minimizing data movement across the network and reducing latency. Once the transformations and actions have been completed on the RDD, the results can be written back to MongoDB.

The Implementation

Based on performance benchmarking, a cluster of less than 20 Red Hat Linux servers (comprising app servers, and the Spark and MongoDB cluster) are required to meet the demands of 180 million fares and 1.6 billion daily searches. China Eastern is using Apache Spark 1.6 against the latest MongoDB 3.2 release, provisioned with Ops Manager for operational automation. Testing has shown each node in the cluster scaling linearly, and delivering 15x higher performance at 10x lower latency than the previous Oracle based system.

Next Steps

In his MongoDB World presentation, Mr. Huang provides more insight into the platform built with Apache Spark and MongoDB. He shares detailed steps and code samples that show how to download and setup the Spark cluster, how to configure the MongoDB Connector for Apache Spark, the process for submitting a job, and lessons learned along the way to optimize performance.

View the slides from MongoDB World

Download our new whitepaper for examples and guidance on turning analytics into real-time action with Apache Spark and MongoDB.

About the Author - Mat Keep

Mat is a director within the MongoDB product marketing team, responsible for building the vision, positioning and content for MongoDB’s products and services, including the analysis of market trends and customer requirements. Prior to MongoDB, Mat was director of product management at Oracle Corp. with responsibility for the MySQL database in web, telecoms, cloud and big data workloads. This followed a series of sales, business development and analyst / programmer positions with both technology vendors and end-user companies.