Columnar Storage

Columnar storage means laying out a table’s values on disk grouped by column rather than by row. A traditional row store keeps all the fields of one record next to each other, which is ideal when you read or write whole rows. A column store instead keeps all the values of one column together, which is ideal for analytical queries that touch only a few columns but scan many rows.

The approach was crystallized for databases by the C-Store research project. The paper “C-Store: A Column-oriented DBMS” (Stonebraker et al., presented at VLDB 2005) argued that for read-mostly analytical workloads, organizing storage around columns rather than rows yields large performance gains, because a query reads only the columns it references and because values within a single column are similar and compress well.

Because each column holds values of one type and often with repetition, column stores compress far more aggressively than row stores, which shrinks the amount of data the disk and CPU must move. This is why columnar formats became the backbone of modern analytics. Apache Parquet, an open standard file format, describes itself as “an open source, column-oriented data file format designed for efficient data storage and retrieval.”

Columnar layouts now underpin a wide range of analytical systems, from on-disk file formats like Parquet to cloud data warehouses such as Amazon Redshift, Snowflake, and Google BigQuery. The trade-off is that column stores are poorly suited to transactional workloads that insert and update individual rows, which is why columnar and row-oriented designs tend to serve different workloads.

Sources

Related