Distributed System

A distributed system is one whose components live on separate networked computers and coordinate their actions only by sending messages to one another. There is no shared memory and no single clock; each machine sees its own local state and learns about the rest of the system only through messages that take time to arrive and may be lost. The goal is to make this collection of machines behave, from a user’s point of view, like one coherent system.

The defining property is also the defining difficulty: partial failure. In a program on a single machine, the whole thing either runs or crashes. In a distributed system, one machine can fail while the others keep running, and the survivors often cannot tell whether a silent peer has crashed or is merely slow. Leslie Lamport captured this in a quip he sent to a Digital Equipment Corporation research distribution list on 28 May 1987: “A distributed system is one in which the failure of a computer you didn’t even know existed can render your own computer unusable.” Lamport’s own account of the line is preserved on his publications page.

That single sentence names the three hard problems at the heart of the field. There is no global clock, so events on different machines cannot be trivially ordered. Operations happen concurrently, so the system must reason about many things changing at once. And failures are partial and ambiguous, so the system must keep working, or fail safely, when some pieces stop responding.

The payoff for accepting this difficulty is large. Spreading work across many machines lets a system scale beyond what any one computer can do, survive the loss of individual machines, and place computation near the users who need it. Nearly every concept in this part of the library, from consensus to replication to the CAP theorem, exists to manage the gap between that promise and the messy reality Lamport described.

Sources

Related