Apache Hive

Apache Hive is a data warehouse system for querying large data sets stored in Hadoop. Its project page describes Hive as “a distributed, fault-tolerant data warehouse system that enables analytics at a massive scale,” built around a “SQL-First Approach” with a “familiar SQL interface.” The goal is to let people work with big data using standard SQL rather than a specialized programming model.

Hive sits on top of Hadoop. The project describes Hive as “built on top of Apache Hadoop,” and its documentation states that Hive “facilitates reading, writing, and managing large datasets residing in distributed storage and queried using SQL syntax.” In practice that means Hive reads files from HDFS and other distributed storage and presents them to users as tables that can be queried.

Crucially, Hive translates SQL queries into lower-level distributed jobs so that analysts do not have to write that code themselves. The documentation notes that Hive originally compiled queries into MapReduce jobs and now also supports engines such as Apache Tez, while providing “standard SQL functionality, including many of the later SQL:2003, SQL:2011, and SQL:2016 features for analytics.” This made Hadoop accessible to the large population of people who already knew SQL.

Hive was originally created at Facebook to support reporting and ad-hoc analysis over very large warehouses, then released as open source through the Apache Software Foundation. By offering a SQL layer over Hadoop, it became one of the most widely used tools for analytics on big data and a model for many later SQL-on-Hadoop systems.

Sources

Related