MongoDB Connector for Apache Spark: Announcing Early Access Program & New Spark Course
**Update: August 4th 2016**
Since this original post, the connector has been declared generally available, for production usage. Click through for a
tutorial on using the new MongoDB Connector for Apache Spark
.
We live in a world of “big data”. But it isn’t only the data itself that is valuable – it’s the insight it can generate. How quickly an organization can unlock and act on that insight has become a major source of competitive advantage. Collecting data in operational systems and then relying on nightly batch ETL (Extract Transform Load) processes to update the Enterprise Data Warehouse (EDW) is no longer sufficient. Speed-to-insight is critical, and so analytics against live operational data to drive real-time action is fast becoming a necessity, enabled by a new generation of technologies like MongoDB and Apache Spark.
The new native
MongoDB Connector for Apache Spark
provides higher performance, greater ease of use, and access to more advanced Spark functionality than any connector available today.
The new MongoDB University course for Apache Spark
provides a fast track introduction for developers and data scientists building new generations of operational applications incorporating sophisticated real-time analytics.
The Rise of Apache Spark
Apache Spark
is one of the fastest-growing big data projects in the history of the Apache Software Foundation. With its memory-oriented architecture, flexible processing libraries, and ease-of-use, Spark has emerged as a leading distributed computing framework for real-time analytics.
As a general-purpose framework, Spark is used for many types of data processing – it comes packaged with support for machine learning, interactive queries (SQL), statistical queries with R, graph processing, ETL, and streaming. Spark allows programmers to develop complex, multi-step data pipelines using a directed acyclic graph (DAG) pattern. It supports in-memory data sharing across DAGs, so that different jobs can work with the same data. Additionally, Spark supports a variety of popular programming languages including Scala, Java, and Python.
Sign up for the new Spark course at MongoDB University.
For loading and storing data, Spark integrates with a number of storage and messaging platforms including Amazon S3, Kafka, HDFS, machine logs, relational databases, NoSQL datastores, MongoDB, and more.
MongoDB and Spark Today
While MongoDB natively offers
rich real-time analytics capabilities
, there are use cases where integrating the Spark engine can extend the processing of operational data managed by MongoDB. This allows users to operationalize results generated from Spark within real-time business processes supported by MongoDB.
Examples of users already using MongoDB and Spark to build modern-data driven applications include:
A multinational banking group operating in 31 countries with 51 million clients has implemented a unified real-time monitoring application with
Apache Spark and MongoDB
. The platform enables the bank to improve customer experience by continuously monitoring client activity across its online channels to check service response times and identify potential issues.
A global manufacturing company estimates warranty returns by analyzing material samples from production lines. The collected data enables them to build predictive failure models using Spark machine learning and MongoDB.
A video sharing website is using Spark with MongoDB to place relevant advertisements in front of users as they browse, view, and share videos.
A global airline has consolidated customer data scattered across more than 100 systems into a single view stored in MongoDB. Spark processes are run against the live operational data in MongoDB to update customer classifications and personalize offers in real time, as the customer is live on the web or speaking with the call center.
Artificial intelligence personal assistant company x.ai
uses MongoDB and Spark
for distributed machine learning problems.
There are a number of ways users integrate MongoDB with Spark. For example, the
MongoDB Connector for Hadoop
provides a plug-in for Spark. There are also multiple 3rd party connectors available.
Today we are announcing the early access to a new native Spark connector for MongoDB.
Introducing the MongoDB Connector for Apache Spark
The new
MongoDB Connector for Apache Spark
provides higher performance, greater ease of use and, access to more advanced Spark functionality than the MongoDB Connector for Hadoop. The following table compares the capabilities of both connectors.
MongoDB Connector for Spark
MongoDB Connector for Hadoop with Spark Plug-In
Written in Scala, Spark’s native language
Yes
No, Java
Support for Scala, Java, Python & R APIs
Yes
Yes
Support for the Spark interactive shell
Yes
Yes
Support for native Spark RDDs
Yes
No
Java RDDs. More verbose and complex to work with
Support for Spark DataFrames and Datasets
Yes
DataFrames Only
Schema must be manually inferred
Automated MongoDB schema inference
Yes
No
Support for Spark core
Yes
Yes
Support for Spark SQL
Yes
Yes
Support for Spark Streaming
Yes
Yes
Support for Spark Machine Learning
Yes
Yes
Support for Spark GraphX
Yes
No
Data locality awareness
Yes
The Spark connector is aware which MongoDB partitions are storing data
No
Support for MongoDB secondary indexes to filter input data
Yes
Yes
Support for MongoDB aggregation pipeline to filter input data
Yes
No
Compatibility with MongoDB replica sets and sharded clusters
Yes
Yes
Support for MongoDB 2.6 and higher
Yes
Yes
Support for Spark 1.6 and above
Yes
Yes
Supported for production usage
Not Currently
Available for early access evaluation
Yes
Written in Spark’s native language, the new connector provides a more natural development experience for Spark users as they are quickly able to apply their Scala expertise. The connector provides access to the Spark interactive shell for data exploration and rapid prototyping. The connector exposes all of Spark’s libraries, enabling MongoDB data to be materialized as DataFrames and Datasets for analysis with SQL (benefiting from automatic schema inference), streaming, machine learning, and graph APIs.
The Spark connector can take advantage of MongoDB’s
aggregation pipeline
and rich
secondary indexes
to extract, filter, and process only the range of data it needs – for example, analyzing all customers located in a specific geography. This is very different from more simple NoSQL datastores that do not offer either secondary indexes or in-database aggregations. In these cases, Spark would need to extract all data based on a simple primary key, even if only a subset of that data is required for the Spark process. This means more processing overhead, more hardware, and longer time-to-insight for the analyst.
To maximize performance across large, distributed data sets, the Spark connector is aware of data locality in a MongoDB cluster. RDDs are automatically co-located with the associated MongoDB shard to minimize data movement across the cluster. The
nearest read preference
can be used to route Spark queries to the closest physical node in a MongoDB replica set, thus reducing latency.
Review the
MongoDB Connector for Spark documentation
to learn how to get started with the connector, and view code snippets for different APIs and libraries.
Fast Track to Apache Spark: New MongoDB University Course
To get the most out of any technology, you need more than documentation and code. Over 350,000 students have registered for developer and operations courses from MongoDB University. Now developers and budding data scientists can get a quick-start introduction to Apache Spark and the MongoDB connector with early access to our new online course.
Getting Started with Spark and MongoDB
provides an introduction to Spark and teaches students how to use the new connector to build data analytics applications. In this course, we provide an overview of the Spark Scala and Java APIs with plenty of sample code and demonstrations. Upon completing this course, students will be able to:
Outline the roles of major components in the Spark framework
Connect Spark to MongoDB
Source data from MongoDB for processing in Spark
Write data from Spark into MongoDB
The course does not assume a prior knowledge of Spark, but does require an intermediate level of expertise with MongoDB.
The course is free. Sign up at
MongoDB University
.
Next Steps
To wrap up, we are very excited about the possibilities Spark and MongoDB present together, and we hope with the new connector and course, you will be well on your way to building modern, data-driven applications. We would love to hear from you as you explore this new connector and put it through its paces - you can provide feedback and file bugs under the
MongoDB Spark Jira project
.
Here’s a summary of how to get started:
Read the
MongoDB Connector for Spark documentation
and
download the connector
If you have any questions, please send them to the
MongoDB user mailing list
Sign up for the new Spark course at MongoDB University
May 18, 2016