by Mike O’Brien, MongoDB Kernel Tools Lead and maintainer of Mongo-Hadoop, the Hadoop Adapter for MongoDB
Hadoop is a powerful, JVM-based platform for running Map/Reduce jobs on clusters of many machines, and it excels at doing analytics and processing tasks on very large data sets.
Since MongoDB excels at storing large operational data sets for applications, it makes sense to explore using these together - MongoDB for storage and querying, and Hadoop for batch processing.
The MongoDB Connector for Hadoop
We recently released the 1.1 release of the MongoDB Connector for Hadoop. The MongoDB Connector for Hadoop makes it easy to use Mongo databases, or MongoDB backup files in .bson format, as the input source or output destination for Hadoop Map/Reduce jobs. By inspecting the data and computing input splits, Hadoop can process the data in parallel so that very large datasets can be processed quickly.
The MongoDB Connector for Hadoop also includes support for Pig and Hive, which allow very sophisticated MapReduce workflows to be executed just by writing very simple scripts.
- Pig is a high-level scripting language for data analysis and building map/reduce workflows
- Hive is a SQL-like language for ad-hoc queries and analysis of data sets on Hadoop-compatible file systems.
Hadoop streaming is also supported, so map/reduce functions can be written in any language besides Java. Right now the MongoDB Connector for Hadoop supports streaming in Ruby, Node.js and Python.
How it Works
How the Hadoop connector works
- The adapter examines the MongoDB Collection and calculates a set of splits from the data
- Each of the splits gets assigned to a node in Hadoop cluster
- In parallel, Hadoop nodes pull data for their splits from MongoDB (or BSON) and process them locally
- Hadoop merges results and streams output back to MongoDB or BSON
I’ll be giving an hour-long webinar on What’s New with the Mongo-Hadoop integration. The webinar will cover
The webinar will be offered twice on August 8:
Register for the Webinar on August 8
Update: Watch the webinar recording