Tokenization

Tokenization is the step that turns raw text into the numbered units, called tokens, that a language model actually processes. Modern systems use subword tokenization: common words become single tokens, while rare or novel words are split into smaller reusable pieces. This lets a model handle any input, including misspellings and new terms, with a fixed vocabulary.

The dominant approach, byte-pair encoding (BPE) for text, was introduced for neural translation by Sennrich, Haddow, and Birch in the 2015 paper “Neural Machine Translation of Rare Words with Subword Units.” The abstract frames the problem plainly: translation is “an open-vocabulary problem,” yet models use a fixed vocabulary — subword units bridge that gap. The same idea now underlies tokenizers across today’s LLMs.

Token counts matter practically: models have token limits, and providers bill per token.

Why business readers should care: Tokens are the unit of both cost and capacity for LLMs. Pricing, context-window limits, and even why some languages cost more to process all trace back to how text is tokenized.

Sources

Related