Reinforcement Learning from Human Feedback (RLHF)

Reinforcement Learning from Human Feedback (RLHF) is a method for shaping a model’s behavior using human judgments rather than fixed correct answers. People compare model outputs and say which they prefer; those preferences train a “reward model,” which is then used to push the main model toward responses people actually want.

The core technique was introduced by Christiano et al. in the 2017 paper “Deep reinforcement learning from human preferences,” which framed the problem as needing to “communicate complex goals” to RL systems that are hard to specify with a simple score. The approach became central to language models through OpenAI’s 2022 InstructGPT paper (Ouyang et al.), which applied RLHF to make GPT-style models follow instructions and behave more helpfully and safely. RLHF is the step that turned raw text predictors into the assistant-style models people use today.

Why business readers should care: RLHF is the reason chatbots feel helpful and polite rather than just statistically plausible. It also explains why “alignment” work depends on human labor and judgment, which has cost, quality, and bias implications.

Reinforcement Learning from Human Feedback (RLHF)

Sources

Related