Pre-training

Pre-training is the first and most expensive phase of building a modern language model: the model is trained on a massive, general body of text to learn the structure of language before it is ever pointed at a specific task. Because the text is unlabeled, the model learns by predicting missing or following words, which teaches it grammar, facts, and reasoning patterns as a side effect.

The approach was set out in OpenAI’s 2018 report “Improving Language Understanding by Generative Pre-Training” (Radford, Narasimhan, Salimans, and Sutskever), the GPT-1 paper, which showed that a model pre-trained on unlabeled text and then lightly adapted beat task-specific models. The same year, Google’s BERT paper (Devlin et al., 2018) used a bidirectional pre-training objective and stated it “introduce[s] a new language representation model called BERT.” Both established the now-standard two-stage recipe: pre-train broadly, then adapt.

This separation is what makes today’s “foundation models” economical: the costly general training is done once, and many applications reuse that single base.

Why business readers should care: Pre-training is where most of an AI model’s cost and capability live. It explains why only a handful of well-funded organizations build base models, while everyone else adapts those models for specific uses.