Thompson Sampling

Thompson sampling is a strategy for the explore-exploit problem: when you can try several options and only learn how good each is by trying it, how do you balance using the option that currently looks best against testing others that might be better? The idea, first described by William Thompson in 1933, is elegantly simple. Keep a probability distribution representing your current belief about each option’s reward. To choose, draw one random sample from each option’s distribution and pick whichever sample is highest. Then update the chosen option’s distribution with the result.

This naturally balances exploration and exploitation. Options the system is unsure about have wide distributions, so they occasionally produce high samples and get tried; options that have proven mediocre have narrow distributions centered low and rarely win. As evidence accumulates, the choices concentrate on the genuinely best options. For decades it was treated as a heuristic curiosity. In a widely cited 2011 paper at NeurIPS, Olivier Chapelle and Lihong Li gave an empirical evaluation on simulated and real data, including display advertising, and argued that despite being one of the oldest such heuristics, Thompson sampling is competitive with more elaborate methods and should be a standard baseline.

Thompson sampling is now common in online systems that must learn continuously, such as choosing which article, ad, or product to show, and in A/B-test-style optimization where you want to send more traffic to better variants as you learn.

For a business reader, Thompson sampling is a practical answer to a constant dilemma: it lets a recommendation or ad system keep earning from what already works while still probing for something better, automatically and with very little code.

Sources

Related