In 1990 a team at IBM’s Thomas J. Watson Research Center, including Peter F. Brown, John Cocke, Stephen and Vincent Della Pietra, Frederick Jelinek, John Lafferty, Robert Mercer, and Paul Roossin, published “A Statistical Approach to Machine Translation” in the journal Computational Linguistics. The paper is openly available through the ACL Anthology.
For decades, machine translation had been pursued mainly by writing explicit linguistic rules: grammars, dictionaries, and hand-coded transformations from one language to another. The IBM team took a different stance. They treated translation as a problem of probability, asking which sentence in the target language was most likely given a sentence in the source language, and estimating those probabilities from large collections of already-translated text.
This was a paradigm shift. Instead of encoding human knowledge of language by hand, the method learned its associations from data: it counted how words and phrases lined up across millions of translated sentence pairs and used those statistics to drive translation. The approach drew directly on ideas from speech recognition, where Jelinek and colleagues had already shown that statistical models trained on data could beat rule-based systems.
For business readers, this 1990 paper marks the intellectual root of nearly everything that followed in language AI. The core bet, that you can learn language behavior from enough data rather than spelling out the rules, runs straight through statistical translation to word embeddings, sequence-to-sequence models, and today’s large language models.