MongoDB Connector for Apache Spark

Build new classes of sophisticated, real-time analytics by combining Apache Spark, the industry's leading data processing engine, with MongoDB, the industry’s fastest growing database. The MongoDB Connector for Apache Spark is generally available, certified, and supported for production usage today. Sign up for the free MongoDB University course to get you on the fast track to your next data science project.

Access Insights Now

We live in a world of “big data”. But it isn’t just the data itself that is valuable – it’s the insight it can generate. How quickly an organization can unlock and act on that insight has become a major source of competitive advantage. Collecting data in operational systems and then relying on nightly batch extract, transform, load (ETL) processes to update the enterprise data warehouse (EDW) is no longer sufficient.

Databricks Logo

Users are already combining Apache Spark and MongoDB to build sophisticated analytics applications. The new native MongoDB Connector for Apache Spark provides higher performance, greater ease of use, and access to more advanced Apache Spark functionality than any MongoDB connector available today.

Reynold Xin, Co-Founder and Chief Architect of Databricks

Unlock the Power of Apache Spark

The MongoDB Connector for Apache Spark exposes all of Spark’s libraries, including Scala, Java, Python and R. MongoDB data is materialized as DataFrames and Datasets for analysis with machine learning, graph, streaming, and SQL APIs.

MongoDB Spark Illustration

Leverage the Power of MongoDB

The MongoDB Connector for Apache Spark can take advantage of MongoDB’s aggregation pipeline and rich secondary indexes to extract, filter, and process only the range of data it needs – for example, analyzing all customers located in a specific geography. This is very different from simple NoSQL datastores that do not offer secondary indexes or in-database aggregations. In these cases, Spark would need to extract all data based on a simple primary key, even if only a subset of that data is required for the Spark process. This means more processing overhead, more hardware, and longer time-to-insight for data scientists and engineers.

To maximize performance across large, distributed data sets, the MongoDB Connector for Apache Spark can co-locate Resilient Distributed Datasets (RDDs) with the source MongoDB node, thereby minimizing data movement across the cluster and reducing latency.

MongoDB and Apache Spark: Working for Data Science Teams Today

While MongoDB natively offers rich real-time analytics capabilities, there are use cases where integrating the Apache Spark engine can extend the processing of operational data managed by MongoDB. This allows users to operationalize results generated from Spark within real-time business processes supported by MongoDB.

  • China Eastern Airlines uses the MongoDB Connector for Apache Spark in it’s new fare calculation engine, serving 1.6 billion queries per day.

  • Black Swan, a big data visualization platform, uses MongoDB and Spark to analyze data collected from social media as well as from customer systems to build powerful machine learning models that blend consumer insight with marketing programs and demand forecasting to better predict business results that inform investment priorities.

  • Artificial intelligence personal assistant company x.ai uses MongoDB and Spark for distributed machine learning problems.

  • Stratio implemented its Pure Spark big data platform, combining Apache Spark and MongoDB, for a real-time customer experience application at a multinational banking group operating in 31 countries with 51 million clients.

  • A global airline has consolidated customer data scattered across more than 100 systems into a single view stored in MongoDB. Spark processes are run against the live operational data in MongoDB to update customer classifications and personalize offers in real time, as the customer is live on the web or speaking with the call center.

Databricks Logo

Building an artificial intelligence (AI) application requires huge amounts of data to be processed at once, both reliably and efficiently. To store all that data, we use MongoDB for its flexible data model and its scaling capabilities. And to process all of that data to build machine learning models, we build robust pipelines in Scala using the distributed data processing capabilities of Spark. Now, with the new native MongoDB Connector for Apache Spark, we have an even better way of connecting up these two key pieces of our infrastructure.

Next Steps