Tokenization Cost Disparity Across Languages

Tokenization is the step where a language model chops text into the small units, tokens, that it actually processes, and how a model is billed and length-limited. A subtle unfairness hides in this step: tokenizers are trained mostly on English-heavy data, so they encode English efficiently but split many other languages into far more tokens to express the very same meaning. The 2023 paper “Do All Languages Cost the Same? Tokenization in the Era of Commercial Language Models” (arXiv 2305.13707) by Orevaoghene Ahia, Sachin Kumar, Hila Gonen, Noah A. Smith, Yulia Tsvetkov, and colleagues measured this across 22 typologically diverse languages.

Their finding was a compounding inequality. Because commercial APIs charge per token and cap context by tokens, speakers of many languages are effectively overcharged for the same information, and they often get poorer model performance on top of it. The paper notes this falls hardest on speakers who tend to come from regions where the APIs are already less affordable, turning a technical detail into a regressive cost structure.

This is a clear example of how a low-level design choice can quietly encode bias. It is not malice in the model’s reasoning; it is the arithmetic of how text gets segmented before the model ever runs.

For businesses operating globally, the practical takeaway is that the same product can cost noticeably more to run, and work less well, for non-English customers, and that the choice of tokenizer is a fairness and budgeting decision, not just an engineering one.

Tokenization Cost Disparity Across Languages

Sources

Related