Context Window

A context window is the maximum amount of text a model can consider at once, counting both the input you provide and the output it generates. It is measured in tokens (chunks of text roughly the size of a word fragment), not characters or words. Everything the model “sees” for a given request - your instructions, any documents you paste in, the conversation so far, and its own reply - has to fit inside this window.

The concept traces to the architecture that powers modern language models. The 2017 paper “Attention Is All You Need” introduced the Transformer, a design “based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.” Attention lets every position in a sequence relate directly to every other position, which is what makes a fixed block of context the natural unit a Transformer operates on. The cost of attention grows quickly with sequence length, which is why early models had small windows and why extending them has been a major engineering focus.

Windows have grown dramatically. Where early systems handled a few thousand tokens, vendors now publish far larger limits in their own documentation: Anthropic’s model overview, for example, lists context windows of 200,000 tokens for some Claude models and up to 1 million tokens for others, with the larger window equating to hundreds of thousands of words of input. Each model’s official documentation is the authoritative source for its current limit.

Why business readers should care: the context window sets a hard ceiling on how much material - a contract, a codebase, a quarter of support tickets - you can feed a model in one shot. When your content exceeds the window, you must either summarize, chunk it, or use retrieval to pull in only the relevant pieces. Bigger windows are also priced per token, so a large context is powerful but not free.

Sources

Related