In the early 1990s, Gerald Tesauro of IBM built TD-Gammon, a neural network that learned to play backgammon almost entirely by playing against itself. Rather than being told good moves by human experts, the program adjusted its own evaluation of board positions using temporal-difference (TD) learning, a reinforcement learning method that updates predictions based on the difference between successive estimates. Tesauro described the full system in “Temporal Difference Learning and TD-Gammon,” published in Communications of the ACM in March 1995 (Vol. 38, No. 3), with the technical groundwork laid in his earlier 1992 paper “Practical Issues in Temporal Difference Learning” in the journal Machine Learning.
The results were striking for the time. Through self-play, the network discovered sophisticated positional judgment that rivaled the best human players. Tesauro reported that an early version lost only about a quarter of a point per game against world-class opponents, and that later versions reached near-parity with former World Champion Bill Robertie. Robertie assessed that TD-Gammon “plays at a strong master level,” while elite player Kit Woolsey said “its positional judgment is far better than mine.”
TD-Gammon mattered because it was a concrete proof that a system could reach expert play in a real game by learning from raw experience rather than from hand-coded rules or human game records. That idea - a neural network improving through self-play and reinforcement learning - is the direct ancestor of later DeepMind systems like the Deep Q-Network and AlphaGo, which cite this lineage of learning-from-self-play that TD-Gammon pioneered.