Failure Detector

A failure detector is the part of a distributed system responsible for deciding which other nodes are still alive and which have crashed. The usual mechanism is heartbeats: each node periodically signals that it is still running, and if a node stops hearing from a peer within some timeout, it begins to suspect that peer has failed. Higher-level protocols, such as leader election and cluster membership, then act on those suspicions.

The deep difficulty is that in an asynchronous network there is no reliable way to tell a crashed node from a merely slow one. A node that has truly died and a node whose messages are delayed look identical from the outside, so any failure detector that relies on timeouts must guess. This is closely tied to the FLP impossibility result, which shows that consensus cannot be guaranteed in a purely asynchronous system where even one process may fail.

The theory of failure detectors was formalized by Tushar Chandra and Sam Toueg in their paper “Unreliable Failure Detectors for Reliable Distributed Systems.” Rather than demanding a perfect detector, they introduce failure detectors that can make mistakes, characterized by properties of completeness, meaning crashed nodes are eventually suspected, and accuracy, meaning correct nodes are not wrongly suspected forever. They show that even a weak, unreliable detector is enough to solve consensus, isolating exactly how much failure information a system needs.

Because perfection is unattainable, a real failure detector is a tuning problem: a short timeout detects genuine crashes quickly but raises more false suspicions of healthy-but-slow nodes, while a long timeout is more accurate but slower to react. Systems that gossip heartbeats around a cluster, such as Apache Cassandra, layer a failure detector on top of that information to make the up-or-down decision, trading detection speed against accuracy.

Sources

Related