InstructGPT brings RLHF to GPT-3

In January 2022, OpenAI published a blog post, “Aligning language models to follow instructions,” introducing InstructGPT, with the accompanying paper “Training language models to follow instructions with human feedback” (Ouyang and colleagues) released soon after. The work applied reinforcement learning from human feedback, or RLHF, to GPT-3 so that the model would actually do what a user asked rather than merely continue text in a plausible way.

The recipe had three steps. First, human labelers wrote demonstrations of good responses to fine-tune the base model. Second, labelers ranked sets of model outputs, and those rankings trained a reward model that scored responses. Third, the language model was further optimized with reinforcement learning to maximize that reward. The paper reports a striking result: outputs from the 1.3-billion-parameter InstructGPT were preferred by human evaluators over outputs from the 175-billion-parameter GPT-3, despite InstructGPT having about 100 times fewer parameters, while also being more truthful and less toxic.

InstructGPT was the direct technical precursor of ChatGPT, which launched later in 2022 using the same RLHF approach. It established alignment-by-human-feedback as the standard way to turn a raw language model into a usable assistant.

InstructGPT brings RLHF to GPT-3

Sources

Related