When To Use Apache Spark With MongoDB

Apache Spark is a powerful processing engine designed for speed, ease of use, and sophisticated analytics. Spark particularly excels when fast performance is required. MongoDB is a popular NoSQL database that enterprises rely on for real-time analytics from their operational data. As powerful as MongoDB is on its own, the integration of Apache Spark extends analytics capabilities even further to perform real-time analytics and machine learning.

With Spark and MongoDB, developers can build more functional applications faster using a single database technology. The integration of these two Big Data technology also saves operations teams the hassle of shuttling data between separate operational and analytics infrastructure. For CIOs, the combined forces enable faster time-to-insight for their businesses, with lower cost and risk.

Here are just a few of scenarios of when to use Apache Spark with MongoDB.

Rich Operators & Algorithms. Spark supports over 100 different operators and algorithms for processing data. Developers can use these to perform advanced computations that would otherwise require more programming effort to combine the MongoDB aggregation framework with application code.

For example, a web analytics platform built on MongoDB would provide insight into the performance of your content by geography and by audience. Adding Spark’s machine learning algorithms would allow you to go even further by taking those insights and then serving up targeted content recommendations for your readers.

Processing Paradigm. Many programming languages can use their own MongoDB drivers to execute queries against the database, returning results to the application where additional analytics can be run using standard machine learning and statistics libraries.

In this scenario, a developer could use the MongoDB Python or R drivers to query the database. But this process becomes increasingly complex as you need to distribute the application across multiple threads and nodes. Using Apache Spark makes this kind of distributed processing easier and faster to develop because Spark jobs can be directly performed against data in MongoDB. As a result, the integration makes fast, real-time analysis possible.

Skills Re-Use. With libraries for SQL, machine learning and others – combined with programming in Java, Scala and Python – developers can leverage existing skills and best practices to build sophisticated analytics workflows on top of MongoDB.

Together MongoDB and Apache Spark are enabling success by turning analytics into real-time action. Learn more about how this integration can benefit your organization by downloading our white paper.