Data Warehouse

A data warehouse is a database built to hold integrated, historical data drawn together from many operational source systems so that it can be analyzed and reported on. Unlike a transactional database, which is tuned for processing many small reads and writes as business happens, a warehouse is tuned for large aggregate queries that scan history to answer questions about the business. It is the home of business intelligence and reporting.

The modern discipline of data warehousing was shaped in the 1990s by two influential practitioners. Ralph Kimball and the Kimball Group promoted dimensional modeling and an incremental delivery method; the Kimball Group describes the goal as to “iteratively develop the DW/BI environment in manageable lifecycle increments rather than attempting a galactic Big Bang approach.” Bill Inmon, separately, advocated a more centralized, normalized enterprise warehouse. The two schools of thought defined much of the field’s vocabulary.

One reason warehouses exist as a separate system is that analytical and transactional workloads conflict. Microsoft’s architecture guidance notes that running aggregate analytics directly against a transactional system is “very resource-intensive” and can “cause a slow-down by blocking other transactions,” so the common solution is to offload historical data “to other systems, such as a data mart or data warehouse.” Separating the two lets each be designed for its own job.

A warehouse is typically loaded on a schedule from source systems through an extract-transform-load pipeline, then queried by analysts and reporting tools. Cloud warehouses such as Snowflake, Amazon Redshift, and Google BigQuery have since carried the same idea onto elastic, pay-as-you-go infrastructure, often using columnar storage to make large scans fast.

Sources

Related