August 18, 2013 by MongoDB | Comments
The MongoDB engineering team has recently made a series of significant updates to the MongoDB Connector for Hadoop. This makes it easier for Hadoop users to integrate real-time data from MongoDB – the most popular database for big data systems – with Hadoop for deep, offline analytics. The Connector exposes the analytical power of Hadoop's MapReduce to live application data from MongoDB, driving value from big data faster and more efficiently.
The Connector presents MongoDB as a Hadoop-compatible file system allowing a MapReduce job to read from MongoDB directly without first copying it to HDFS, thereby eliminating the need to move Terabytes of data across the network. MapReduce jobs can pass queries as filters, so avoiding the need to scan entire collections, and can also take advantage of MongoDB’s rich indexing capabilities including geospatial, text-search, array, compound and sparse indexes.
As well as reading from MongoDB, the results of Hadoop jobs can also be written back out to MongoDB, to support real-time operational processes and ad-hoc querying.
Version 1.1 of Connector adds support for MongoDB’s native BSON (Binary JSON) backup files, which can be stored locally in HDFS and co-located with TaskTrackers, where they can be processed directly by Hadoop, or on local or cloud-based file systems such as Amazon S3.
In addition to existing MapReduce, Pig, Hadoop Streaming (with node.js, Python or Ruby) and Flume support, the new MongoDB Hadoop connector enables SQL-like queries from Apache Hive to be run across MongoDB data sets. The latest version of the Connector enables Hive to access BSON files, with full support for MongoDB collections scheduled for the next release of the Connector later this year.
MongoUpdateWriteable is another new feature of the Connector. This allows Hadoop to modify an existing output collection in MongoDB, rather than only writing to new collections. As a result, users can run incremental MapReduce jobs, for example to aggregate trends or pattern matching on a daily basis, which can then by efficiently queried in a single collection by MongoDB.
The MongoDB Connector for Hadoop works by:
Mike O’Brien, MongoDB software engineer and maintainer of the MongoDB Connector for Hadoop demonstrated its new features in a recent webinar –
which is now available for viewing on-demand
Following on from Mike’s webinar, we will also host a new session on Wednesday 21st August exploring the big data use cases of MongoDB and Hadoop, and the value of integration between them in creating a big data pipeline
In summary, the MongoDB Connector for Hadoop adds to the broadest set of query and data analysis capabilities of any NoSQL database including:
Review the documentation, including details on how to get started and sample code
If you have any questions, email the mongodb-user Mailing List
We’d also love to hear how you can use the connector to bring together MongoDB and Hadoop – feel free to comment below