Multi-Armed Bandit

A multi-armed bandit is a decision problem named after a row of slot machines, the old slang for which was one-armed bandits. Imagine you face several machines, each paying out at an unknown and different average rate, and you have a fixed number of pulls. Every pull is a tradeoff: you can exploit the machine that has paid best so far, or explore a machine you have tried less in case it is secretly better. The bandit captures the essence of learning from sequential, rewarded choices stripped of all other complexity.

The formulation traces to the statistician Herbert Robbins, whose 1952 paper “Some Aspects of the Sequential Design of Experiments” in the Bulletin of the American Mathematical Society (Vol. 58, pages 527-535) posed the problem of which of two populations to sample at each stage to maximize total reward. Earlier roots go back to William Thompson’s 1933 work on choosing between medical treatments. Robbins’s framing turned it into a clean mathematical object with provable strategies.

The bandit is the simplest member of the reinforcement learning family: it has actions and rewards but no states, so the action you take does not change the situation you face next. That makes it the natural place to study the exploration-exploitation tradeoff in isolation. Why business readers should care: bandit algorithms now power practical systems such as website A/B testing, ad selection, news and content recommendation, and clinical trial design, anywhere you must keep earning while you are still learning what works.