Cosmic Rays and Soft Errors

A soft error is a transient fault in a computer: a stored bit flips from one value to the other, corrupting data without permanently damaging the hardware. Unlike a hard error, which reflects a broken cell that will keep failing, a soft error is a one-time event. Rewrite the affected location and it behaves correctly again. The trouble is that the flip can happen silently, turning a correct number or instruction into a wrong one with no warning, which is why soft errors became a central concern in the reliability of memory chips as densities grew.

The two dominant physical causes are alpha particles and cosmic rays. Alpha particles emitted by trace radioactive impurities in chip packaging materials can deposit charge as they pass through silicon; this source was identified in the late 1970s. The second source is cosmic radiation. High-energy particles from space strike the upper atmosphere and produce showers of secondary particles, mainly neutrons, that reach ground level and can disrupt a memory cell when they strike it. As individual cells shrank and held less charge, the amount of disturbance needed to flip a bit fell, making chips more sensitive to both effects.

The foundational understanding of the cosmic-ray contribution came from a long program of research at IBM led by James F. Ziegler. His paper “Terrestrial Cosmic Rays,” published in the IBM Journal of Research and Development in 1996, reviews the physics of the cosmic-ray particles that reach the surface of the Earth and explains which of them, principally neutrons, protons, and pions, are energetic and numerous enough to cause soft failures in electronics. That issue of the journal collected IBM’s broader body of work measuring soft-error rates and modeling how they scale with technology, establishing the field on a quantitative footing.

This research made the case for designing memory systems that assume errors will happen rather than hoping they will not. The standard answer is error-correcting code (ECC) memory, which stores extra check bits so that a single-bit flip can be detected and corrected on the fly and many multi-bit errors can at least be detected. Servers, aerospace systems, and other reliability-critical machines routinely use ECC for exactly this reason, and the soft-error rate of a memory design is a measured engineering parameter, not an afterthought.

Soft errors reframe an intuition many programmers hold, that memory faithfully returns what was written to it. At scale and over time, that assumption is only approximately true: a sufficiently large fleet of machines will see bits flipped by particles from space. The lineage runs forward to later memory-integrity concerns, including disturbance effects like Rowhammer, where the threat is no longer a stray particle but a deliberate access pattern, yet the underlying lesson is the same, that the bit is a physical thing and physics can change it.