Apache HBase

Apache HBase is an open-source database for very large tables that runs on top of Hadoop. Its project page describes HBase as “a distributed, scalable, big data store for random, realtime read/write access” and as an “open-source, distributed, and scalable big data store modeled after Google Bigtable.” Where Hadoop’s MapReduce is built for batch jobs that scan whole data sets, HBase is built for reading and writing individual rows quickly.

The project page makes the Bigtable lineage explicit, saying HBase offers “Bigtable-like capabilities on top of Hadoop and HDFS with automatic failover and sharding,” and supports “random, realtime read/write access with strictly consistent operations.” It is a wide-column store: data is organized into tables, rows, and column families rather than the rows-and-fixed-columns model of a traditional relational database.

HBase stores its data in HDFS, Hadoop’s distributed file system, so it inherits HDFS’s ability to spread huge data sets across a cluster and survive machine failures. The HBase reference guide confirms this layering: it configures an “hbase.rootdir” pointing at an HDFS instance and depends on specific Hadoop versions. HBase adds, on top of that storage, indexing and serving machinery that turns bulk distributed storage into a database with fast point lookups.

HBase implements in open source the ideas described in Google’s Bigtable paper, just as Hadoop and HDFS did for MapReduce and the Google File System. Together these projects gave organizations a freely available stack for both batch analytics and real-time access over data sets far too large for a single machine.

Sources

Related