Hadoop is a software technology designed for storing and processing large volumes of data distributed across a cluster of commodity servers and commodity storage. Hadoop was initially inspired by papers published by Google outlining its approach to handling large volumes of data as it indexed the Web. With growing adoption across industry and government, Hadoop has rapidly evolved to become an adjunct to – and in some cases a replacement of – the traditional Enterprise Data Warehouse.
Many organizations are harnessing the power of Hadoop and MongoDB together to create complete big data applications:
Before exploring how users create this type of big data application, first lets dig into the architecture of Hadoop.
Hadoop is an open-source Apache project started in 2005 by engineers at Yahoo, based on Google’s earlier research papers. Hadoop then consisted of a distributed file system, called HDFS, and a data processing and execution model called MapReduce.
The base Apache Hadoop framework consists of the following modules:
In addition to these base modules, the term "Hadoop" has evolved to also include a host of tools that can be installed on top of or alongside Hadoop to simplify access and processing of data stored in the Hadoop cluster:
Building on the Apache Hadoop project, a number of companies have built commercial Hadoop distributions. Leading providers include MongoDB partners Cloudera, Hortonworks and MapR. All have certified the MongoDB Connector for Hadoop with their respective distributions.
Applications submit work to Hadoop as jobs. Jobs are submitted to a Master Node in the Hadoop cluster, to a centralized process called the JobTracker. One notable aspect of Hadoop’s design is that processing is moved to the data rather than data being moved to the processing. Accordingly, the JobTracker compiles jobs into parallel tasks that are distributed across the copies of data stored in HDFS. The JobTracker maintains the state of tasks and coordinates the result of the job from across the nodes in the cluster.
Hadoop determines how best to distribute work across resources in the cluster, and how to deal with potential failures in system components should they arise. A natural property of the system is that work tends to be uniformly distributed – Hadoop maintains multiple copies of the data on different nodes, and each copy of the data requests work to perform based on its own availability to perform tasks. Copies with more capacity tend to request more work to perform.
There are several architectural properties of Hadoop that help to determine the types of applications suitable for the system:
Rather than supporting real-time applications, Hadoop lends itself to almost for any sort of computation that is very data-intensive, benefits from parallel processing, and is batch-oriented or interactive (i.e., 30 seconds to 10 minute response times). Organizations typically use Hadoop for sophisticated, read-only analytics or high volume data storage applications such as:
The following table provides examples of customers using MongoDB together with Hadoop to power big data applications.
|eBay||User data and metadata management for product catalog||User analysis for personalized search & recommendations|
|Orbitz||Management of hotel data and pricing||Hotel segmentation to support building search facets|
|Pearson||Student identity and access control. Content management of course materials||Student analytics to create adaptive learning programs|
|Foursquare||User data, check-ins, reviews, venue content management||User analysis, segmentation and personalization|
|Tier 1 Investment Bank||Tick data, quants analysis, reference data distribution||Risk modeling, security and fraud detection|
|Industrial Machinery Manufacturer||Storage and real-time analytics of sensor data collected from connected vehicles||Preventive maintenance programs for fleet optimization. In-field monitoring of vehicle components for design enhancements|
|SFR||Customer service applications accessed via online portals and call centers||Analysis of customer usage, devices & pricing to optimize plans|
Whether improving customer service, supporting cross-sell and upsell, enhancing business efficiency or reducing risk, MongoDB and Hadoop provide the foundation to operationalize big data.
Watch the presentation and learn: