Apache Hadoop

Apache Hadoop is an open-source framework for distributed computing. Its own project page describes it as software “for reliable, scalable, distributed computing” that enables “the distributed processing of large data sets across clusters of computers using simple programming models.” Rather than relying on a few expensive high-end servers, Hadoop is designed to scale “from single servers to thousands of machines,” each offering local computation and storage.

A defining design choice is how Hadoop handles failure. The project page states that the library is “designed to detect and handle failures at the application layer,” so that high availability is delivered in software on top of clusters of ordinary, failure-prone machines rather than depending on expensive fault-tolerant hardware. This made it practical to build very large clusters cheaply.

Hadoop is organized into modules. The project lists Hadoop Common (shared utilities), the Hadoop Distributed File System (HDFS) for high-throughput storage, Hadoop YARN for job scheduling and cluster resource management, and Hadoop MapReduce, a YARN-based system for parallel processing of large data sets. Together these provide both a place to store huge files and a way to compute over them.

Hadoop grew out of work by Doug Cutting and Mike Cafarella, who reimplemented ideas from Google’s MapReduce and Google File System papers as open source. By making distributed storage and batch processing freely available, Hadoop became foundational infrastructure for the big-data era and the base layer for an entire ecosystem of tools, including Hive and HBase.

Sources

Related