MongoDB Hadoop Connector Announced
10gen is pleased to announce the availability of our first GA release of the
MongoDB Hadoop Connector
, version 1.0. This release was a long-term goal, and represents the culmination of over a year of work to bring our users a solid integration layer between their MongoDB deployments and Hadoop clusters for data processing. Available immediately, this connector supports many of the major Hadoop versions and distributions from 0.20.x and onwards.
The core feature of the Connector is to provide the ability to read MongoDB data into Hadoop MapReduce jobs, as well as writing the results of MapReduce jobs out to MongoDB. Users may choose to use MongoDB reads and writes together or separately, as best fits each use case. Our goal is to continue to build support for the components in the
Hadoop ecosystem
which our users find useful, based on feedback and requests.
For this initial release, we have also provided support for:
writing to MongoDB from
Pig
(thanks to
Russell Jurney
for all of his patches and improvements to this feature)
writing to MongoDB from the
Flume
distributed logging system
using Python to MapReduce to and from MongoDB via
Hadoop Streaming
.
Hadoop Streaming was one of the toughest features for the 10gen team to build. To that end, look for a more technical post on the
MongoDB blog
in the next week or two detailing the issues we encountered and how to utilize this feature effectively.
This release involved hard work from both the 10gen team, as well as our community. Testing, pull requests, email ideas and support tickets have all contributed to moving this product forward. One of the most important contributions was from a team of students participating in a
New York University
class in
Information Technology Projects
which is designed to have students apply their skills to real world projects. Under the guidance of Professor
Evan Korth
, four students worked closely with 10gen to test and improve the functionality of the Hadoop Connector. Joseph Shraibman, Sumin Xia, Priya Manda, and Rushin Shah all worked to enhance and improve support for splitting up MongoDB input data, as well as adding a number of testing improvements and consistency checks.
Thanks to the work done by the NYU team as well as improvements to the MongoDB server, the MongoDB Hadoop Connector is capable of efficiently splitting input data in a variety of situations - in both sharded and unsharded setups - to parallelize the Hadoop input as efficiently as possible for maximum performance.
In the next few months we will be working to add additional features and improvements to the Hadoop Connector including Ruby support for Streaming, Pig input support, and support for reading and writing MongoDB Backup Files for offline batch processing. As with all of our MongoDB projects, you can always monitor the roadmap, request features, and report bugs via the
MongoDB Jira
and let us know on the
MongoDB User Forum
if you have any questions.
Tagged with: mongodb hadoop, nyu, MongoDB, Mongo, NoSQL, Polyglot persistence, 10gen
April 10, 2012