Out-of-Order Execution

Out-of-order execution lets a processor run instructions in whatever order their input operands become ready, instead of strictly following the order in which the programmer wrote them. The motivation is that a single slow instruction, such as a load that misses the cache, should not block every instruction behind it. If later instructions do not depend on the stalled one, they can execute while it waits. To preserve correct program behavior, the results are still committed, or retired, in the original program order, so from the outside the machine appears to obey the sequential semantics of the instruction set.

The foundational work is R.M. Tomasulo’s 1967 paper “An Efficient Algorithm for Exploiting Multiple Arithmetic Units,” published in the IBM Journal of Research and Development, describing the floating-point unit of the IBM System/360 Model 91. Tomasulo’s algorithm introduced reservation stations that hold instructions waiting for operands, and a common data bus that broadcasts each result to every station that needs it. The scheme uses register renaming, tracking the producer of each value by a tag rather than by an architectural register name, which removes false dependencies that would otherwise force instructions to wait unnecessarily.

Register renaming is the key insight. Two instructions that merely reuse the same register name, without a real flow of data between them, create what are called name dependencies. By tagging each result with the identity of the instruction that produces it, Tomasulo’s hardware lets such instructions proceed in parallel and resolves only the true data dependencies. This is why the algorithm, designed for a 1960s mainframe, maps so directly onto the renaming logic in modern processors.

Modern out-of-order processors add a reorder buffer, a structure that holds completed-but-not-yet-retired results so that instructions can finish out of order but retire in order. The reorder buffer also provides a clean place to handle exceptions and to discard work that turns out to be unneeded, which is what makes speculative execution past predicted branches safe: speculative results sit in the reorder buffer and are simply thrown away if the speculation was wrong.

Out-of-order execution, combined with superscalar issue and branch prediction, is the standard architecture of high-performance CPUs. The same mechanisms that make it fast, executing work before it is known to be needed and buffering results, are also what attackers exploited in the Spectre and Meltdown class of vulnerabilities, where the side effects of discarded speculative work leaked through processor caches.

Sources

Related