Markov Decision Process

A Markov decision process, or MDP, is the formal way of describing a problem in which an agent makes a sequence of decisions over time, each one affecting what happens next. It has four pieces: the states the agent can be in, the actions it can take, the rewards it earns, and the rules for how actions move it from one state to another. The “Markov” property is the simplifying assumption that the future depends only on the current state, not on the entire history of how the agent got there. That assumption is what makes the math tractable.

The framework traces to Richard Bellman, whose 1957 paper “A Markovian Decision Process,” published in what is now the Indiana University Mathematics Journal, laid out the problem and the idea of solving it by working backward from future value. Bellman’s broader work on dynamic programming gave the field the central tool, the value of being in a state defined in terms of the value of the states it can lead to, that still anchors the subject today.

Why this matters to a business reader: an MDP is the precise statement of problems that involve choosing actions now to maximize a payoff later, with uncertainty along the way. Inventory restocking, equipment maintenance scheduling, dynamic pricing, and treatment planning all fit the shape. Crucially, the MDP is the formal frame that sits underneath reinforcement learning. When people say a system “learns by trial and error to maximize reward,” the problem it is solving is almost always an MDP; reinforcement learning is the set of methods for finding good policies when the rules of the process are unknown or too large to solve exactly.

The honest limits start with the Markov assumption itself. Many real situations have memory or hidden information that a clean MDP ignores, and modeling them faithfully requires more elaborate variants. Solving an MDP exactly also becomes impossible when the number of states explodes, which is exactly why approximate and learning-based methods, rather than Bellman’s original exact recursion, dominate modern practice.

Sources

Last verified June 6, 2026