Apache Kafka

Apache Kafka, on its own project site, describes itself as “an event streaming platform” that combines three capabilities: publishing and subscribing to streams of events, storing those streams durably, and processing them as they arrive or retrospectively. The unit of organization is the topic, which the documentation compares to a folder in a filesystem, with the individual events being the files inside it. A defining property is that events are not deleted after consumption but retained according to configuration, so the same data can be read and replayed many times.

The original design appeared in the paper “Kafka: a Distributed Messaging System for Log Processing” by Jay Kreps, Neha Narkhede, and Jun Rao, presented at the NetDB workshop in 2011. The paper, hosted by Microsoft Research and linked from Kafka’s own community page, frames Kafka as a system built to collect and deliver high volumes of log data with low latency, drawing on ideas from both traditional messaging systems and log aggregators, and supporting both offline and online consumption of the same streams.

A central architectural idea is decoupling. As the intro page puts it, “producers and consumers are fully decoupled and agnostic of each other,” which the project calls a key design element behind Kafka’s scalability. Producers write events to topics; consumers subscribe and read them independently, at their own pace. Topics are partitioned across brokers, and events sharing the same key land in the same partition, which preserves ordering where it matters while allowing horizontal scale.

Kafka began at LinkedIn and became a top-level Apache Software Foundation project. Over time it grew from a high-throughput message bus into a general-purpose substrate for real-time data integration, sitting alongside stream processors such as Apache Spark and Apache Flink at the center of modern data architectures.