Prompt Caching

Prompt caching is an API feature that lets a developer mark a portion of a prompt, typically a large, unchanging block such as a long document, a codebase, or a detailed set of instructions and examples, so the model provider stores its processed form and reuses it on later requests instead of reprocessing it every time. Because reading from the cache is far cheaper than reprocessing the same tokens, applications that repeatedly send the same context become both cheaper and faster.

Anthropic introduced prompt caching for Claude in public beta in August 2024 and made it generally available on its API by December 2024. The mechanics involve a tradeoff: writing to the cache costs more than a normal input token (Anthropic priced cache writes at 25% above the base input rate), but reading from it costs only a fraction (about 10% of the standard input price). For prompts that are reused enough times, the cheap reads more than pay back the more expensive write. Anthropic’s own examples showed a “chat with a book” case with a 100,000-token cached prompt achieving roughly a 90% cost reduction and a 79% latency reduction, and many-shot prompting cutting cost by about 86%. Cached content has a limited lifetime, refreshed each time it is used. Other providers offer comparable caching, sometimes applied automatically.

Why business readers should care: for any product that sends the same long context on every call, a customer-support bot grounded in the same policy manual, or a coding assistant that re-reads the same files, prompt caching can change the unit economics dramatically. It is one of the practical levers, alongside model choice and batching, that determines whether an AI feature is affordable at scale.

Sources

Related