Spark and MongoDB

Combine the leading analytics processing engine with the fastest-growing database for real-time analytics.

With its rich query language, aggregation pipeline and powerful indexing, developers and data scientists can use MongoDB to generate many classes of analytics. Integrating Apache Spark unlocks additional classes of analytics, directly within operational applications, to drive real time insight and action.

What is Spark?

Originally developed in the UC Berkeley AMPLab in 2009 as a general-purpose framework used for data processing, Apache Spark has quickly become one of the fastest-growing big data projects in the history of the Apache Software Foundation. With its memory-oriented architecture and ease-of-use, Spark has emerged as a leading distributed computing framework for real-time analytics.

Spark supports a variety of popular development languages including Scala, Java, R, and Python.

Organizations typically use Spark for:

  • Speed. By exploiting in-memory optimizations, Spark has shown up to 100x higher performance than MapReduce running on Hadoop.

  • A Unified Framework. Spark comes packaged with higher-level libraries, including support for SQL queries, machine learning, stream and graph processing. These standard libraries increase developer productivity and can be combined to create complex workflows.

  • Simplicity. Spark includes easy-to-use APIs for operating on large datasets. This includes a collection of sophisticated operators for transforming and manipulating semi-structured data.

What is MongoDB?

MongoDB is the most popular non-relational database, counting more than one third of the Fortune 100 as customers. Its flexible JSON-based document data model, dynamic schema and automatic scaling on commodity hardware make MongoDB an ideal fit for modern, always-on applications that must manage high volumes of rapidly changing, multi-structured data. Internet of Things (IoT), mobile apps, social engagement, customer data and content management systems are prime examples of MongoDB use cases.

Why Integrate Spark and MongoDB?

When used together, Spark jobs can be executed directly on operational data sitting in MongoDB without the time and expense of ETL processes. MongoDB can then efficiently index and serve analytics results back into live, operational processes. This approach offers many benefits to teams tasked with delivering modern, data driven applications:

  • Developers can build more functional applications faster, using a single database technology.

  • Operations teams no longer need to shuffle data between separate operational and analytics infrastructure, each with its own unique configuration, maintenance and management requirements.

  • CIOs deliver faster time-to-insight for the business, with lower cost and risk.

Use Cases

  • A global manufacturing company has built a pilot project to estimate warranty returns by analyzing material samples from production lines. The collected data enables them to build predictive failure models using Spark Machine Learning and MongoDB.

  • A video sharing website is using Spark with MongoDB to place relevant advertisements in front of users as they browse, view and share videos.

  •  A multinational banking group operating in 31 countries with 51 million clients implemented a unified real-time monitoring application with the Stratio Big Data (BD) platform, running Apache Spark and MongoDB. The bank wanted to ensure a high quality of service across its online channels, and so needed to continuously monitor client activity to check service response times and identify potential issues.

Ready to get started?

Download our white paper to learn more about how Spark and MongoDB work together and how you can get started.