Apache Avro

Apache Avro is a data serialization system that pairs a compact binary encoding with an explicit, JSON-defined schema. Where many formats embed field names or type tags inside every record, Avro keeps the schema separate and uses it to read and write data, so the serialized bytes carry almost no structural overhead. The result is a small, fast wire and file format well suited to high-volume data pipelines, message queues, and long-term storage of structured records.

Avro arose from the Apache Hadoop ecosystem in 2009, created in large part to give Hadoop a language-neutral serialization format that did not depend on Java’s native serialization and that could evolve over time. It defines primitive types such as int, long, string, and bytes, and complex types including records, enums, arrays, maps, unions, and fixed-length data. Schemas are written as JSON, which makes them easy to inspect, store, and transmit alongside the data they describe.

The Avro specification is published in the project documentation and states plainly that it “is intended to be the authoritative specification.” It describes how schemas are declared, how values are encoded in both a compact binary form and a JSON form, and how data is grouped into object container files for persistent storage. Binary integers, for example, use variable-length zig-zag coding so that small numbers occupy few bytes, and records are encoded simply by concatenating their field values in schema order.

A defining strength of Avro is schema evolution. Because both the schema used to write the data and the schema used to read it are available, Avro can resolve differences between them by matching on field names rather than positional identifiers. This lets producers and consumers change independently: fields can be added with defaults, removed, or reordered, and old data remains readable by new code and vice versa, without regenerating or recompiling. This handshake of writer and reader schemas occurs in files and during remote procedure calls alike.

Avro became a standard building block of the big data world, especially as a payload format for Apache Kafka, where its compactness and schema registry integration made it a natural choice for streaming event data. Alongside Protocol Buffers and other schema-driven formats, Avro helped establish the pattern of treating the schema as a first-class, evolvable contract between systems, separating the definition of data from its efficient on-the-wire representation.

Sources

Related