Byzantine Fault Tolerance

Byzantine fault tolerance is the property of a system that continues to operate correctly even when some of its components fail in the worst possible way. A crash fault simply stops a node; a Byzantine fault lets a node do anything at all, including sending different, contradictory messages to different peers, whether through a bug, hardware corruption, or deliberate attack. The term comes from Lamport, Shostak, and Pease’s “The Byzantine Generals Problem” (1982), which framed the challenge as generals who must agree on a plan while some among them are traitors.

The central quantitative result is that tolerating Byzantine faults requires a strict supermajority of honest participants. To withstand up to f arbitrary failures, the system needs more than three times f total nodes, so that the honest nodes always outnumber the combined effect of the faulty ones and any followers they might confuse. This two-thirds honest threshold is far more demanding than crash tolerance, which only needs a simple majority.

For decades Byzantine agreement was considered too expensive for everyday use. That changed with Castro and Liskov’s “Practical Byzantine Fault Tolerance” (OSDI 1999), which showed an algorithm efficient enough to run real replicated services tolerating arbitrary faults, reviving the field for practical systems.

Byzantine fault tolerance is now the conceptual foundation of blockchains and other open, trustless systems, where participants cannot be assumed honest, as well as of high-assurance environments such as avionics and spacecraft, where a single corrupted component must never be able to derail the whole.

Sources

Related