Data Lake

The term “data lake” was coined by James Dixon, then Chief Technology Officer of Pentaho, in a blog post dated October 14, 2010, written alongside a Pentaho Hadoop release announcement. Dixon offered an analogy that has stuck: “If you think of a datamart as a store of bottled water - cleansed and packaged and structured for easy consumption - the data lake is a large body of water in a more natural state.”

The point of the analogy is that a data mart, or a traditional data warehouse, packages data into a fixed structure decided in advance, suited to predetermined questions. A data lake instead keeps data in its raw, natural form so that many different users can later explore it, sample it, or structure it for questions that were not anticipated when the data was collected.

This is the difference between schema on write and schema on read. A warehouse imposes a schema when data is loaded; a data lake defers that step, letting structure be applied at query time according to each user’s needs. In practice this meant raw data could be dumped cheaply onto distributed file systems such as HDFS and later cloud object stores, then processed by engines like Hadoop and Apache Spark.

The trade-off is that flexibility on intake shifts the work of cleaning and structuring to the moment of use, which is why data lakes are typically paired with strong cataloging and processing pipelines to avoid becoming unusable “data swamps.”