Using MongoDB with Hadoop & Spark: Part 1 - Introduction & Setup

Matt Kalan


**Update: August 4th 2016** Since this original post, MongoDB has released a new certified connector for Spark. Click through for a tutorial on using the new MongoDB Connector for Apache Spark.

Hadoop is a software technology designed for storing and processing large volumes of data distributed across a cluster of commodity servers and commodity storage. Hadoop was initially inspired by papers published by Google outlining its approach to handling large volumes of data as it indexed the Web. Many organizations are now harnessing the power of Hadoop and MongoDB together to create complete big data applications: MongoDB powers the online, real time operational application, while Hadoop consumes data from MongoDB and blends its with data from other operational systems to fuel sophisticated analytics and machine learning.

This is part one of a three-part series on MongoDB and Hadoop:

  1. Introduction & Setup of Hadoop and MongoDB
  2. Hive Example
  3. Spark Example & Key Takeaways

Introduction & Setup of Hadoop and MongoDB

There are many, many data management technologies available today, and that makes it hard to discern hype from reality. Working at MongoDB Inc., I know the immense value of MongoDB as a great real-time operational database for applications; however for analytics and batch operations, I wanted to understand more clearly the options available and when to use some of the other great options like Spark.

I started with a simple example of taking 1 minute time series intervals of stock prices with the opening (first) price, high (max), low (min), and closing (last) price of each time interval and turning them into 5 minute intervals (called OHLC bars). The 1-minute data is stored in MongoDB and is then processed in Hive or Spark via the MongoDB Hadoop Connector, which allows MongoDB to be an input or output to/from Hadoop and Spark.

One might imagine a more typical example is that you record this market data in MongoDB for real-time purposes but then potentially run offline analytical models in another environment. Of course the models would be way more complicated – this is just as a Hello World level example. I chose OHLC bars just because that was the data I found easily.


Use case: aggregating 1 minute intervals of stock prices into 5 minute intervals
Input: 1 minute stock prices intervals in a MongoDB database
Simple Analysis: performed in:

  • Hive
  • Spark

Output: 5 minute stock prices intervals in Hadoop

Steps to Set Up the Environment

  • Set up Hadoop environment – Hadoop is fairly involved to set up but fortunately Cloudera makes VMs available with their distribution already installed, including both Hive and Spark. I downloaded Virtualbox (open source VM manager) onto my Mac laptop to run the Cloudera VMs from here.
  • Go through tutorials - I went through the tutorials included in the VM which were pretty helpful; sometimes I felt like I was just cutting and pasting and not knowing what was happening though, especially with Spark. One thing that is obvious from the tutorials is that the learning curve for using “Hadoop” includes learning many products in the ecosystem (Sqoop, Avro, Hive, Flume, Spark, etc.). If I were only doing this simple thing, there is no way I would use Hadoop for it with that learning curve but of course some problems justify the effort.
  • Download sample data – I Googled for some sample pricing data and found these 1 minute bars from this site
  • Install MongoDB on the VM – it is really easy with yum on CentOS from this page
  • Start MongoDB – a default configuration file is installed by yum so you can just run this to start on localhost and the default port 27017
    mongod -f /etc/mongod.conf
  • Load sample data – mongoimport allows you to load CSV files directly as a flat document in MongoDB. The command is simply this:
    mongoimport equities-msft-minute-bars-2009.csv --type csv --headerline -d marketdata -c minibars
  • Install MongoDB Hadoop Connector – I ran through the steps at the link below to build for Cloudera 5 (CDH5). One note is that by default my “git clone” put the mongo-hadoop files in /etc/profile.d but your repository might be set up differently. Also one addition to the install steps is to set the path to mongoimport in build.gradle based on where you installed MongoDB. I used yum and the path to mongo tools was /usr/bin/. Install steps are here

For the following examples, here is what a document looks like in the MongoDB collection (via the Mongo shell). You start the Mongo shell simply with the command “mongo” from the /bin directory of the MongoDB installation.

> use marketdata 
> db.minbars.findOne()
	"_id" : ObjectId("54c00d1816526bc59d84b97c"),
 	"Symbol" : "MSFT",
	"Timestamp" : "2009-08-24 09:30",
	"Day" : 24,
 	"Open" : 24.41,
	"High" : 24.42,
	"Low" : 24.31,
 	"Close" : 24.31,
	"Volume" : 683713

Posts #2 and #3 in this blog series show examples of Hive and Spark using this setup above.

  1. Introduction & Setup of Hadoop and MongoDB
  2. Hive Example
  3. Spark Example & Key Takeaways

Want to learn more about Apache Spark and MongoDB? Read our white paper on turning analytics into real-time action.
Read more about Apache Spark and MongoDB

Read Part 2 >>

About Matt Kalan

Matt Kalan is a Sr. Solution Architect at MongoDB, with extensive experience helping more than 300 customers in financial services and other industries solve business problems with technology. Before MongoDB, Matt grew Progress Software’s Apama Algorithmic Trading and Complex Event Processing (CEP) Platform business in North America and later sold broader operational intelligence solutions to FS firms. He previously worked for Caplin Systems selling solutions to stream real-time market data over the web to FX and FI portals, and for Sapient providing consulting services to global 2000 clients.