State Machine Replication

State machine replication is a technique for building a fault-tolerant service out of unreliable machines. The service is modeled as a deterministic state machine: given the same starting state and the same sequence of commands, it always produces the same outputs and ends in the same state. If several replicas all begin identically and apply the identical sequence of commands, they remain copies of one another, so the failure of any one replica does not stop the service.

Fred Schneider laid out the approach in his ACM Computing Surveys tutorial “Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial” in 1990. He showed that the central requirement is agreement on the order of commands: every working replica must process the same requests in the same sequence. Schneider distinguished the ordering problem from the question of how many replicas are needed to tolerate a given number of crash or Byzantine failures.

Ordering the commands is exactly where consensus enters. Leslie Lamport’s Paxos, and later Raft, exist in large part to give a set of replicas an agreed-upon, gap-free sequence of commands, typically recorded as a replicated log that each replica applies in order.

This pattern underlies much of modern fault-tolerant infrastructure, from coordination services to replicated databases. It cleanly separates two concerns: the application logic, expressed as a deterministic state machine, and the consensus layer, which only has to agree on what order to feed it commands.