Whisper (OpenAI speech-recognition model family)

Whisper is OpenAI’s open-source family of automatic speech recognition models. The official repository describes Whisper as “a general-purpose speech recognition model” that “is trained on a large dataset of diverse audio and is also a multitasking model that can perform multilingual speech recognition, speech translation, and language identification.” Unlike OpenAI’s flagship commercial models, Whisper’s “code and model weights are released under the MIT License,” making it freely usable and self-hostable.

The technical foundation is the paper “Robust Speech Recognition via Large-Scale Weak Supervision” (arXiv 2212.04356) by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. The key idea is that training a Transformer sequence-to-sequence model on a very large and varied set of audio - using weak supervision rather than carefully hand-labeled data - produces a model that generalizes well across many languages and recording conditions without task-specific fine-tuning.

Whisper has shipped in multiple sizes (from small, fast models to larger, more accurate ones) and through later turbo variants tuned for speed; the repository is the live reference for the current set of available checkpoints and sizes, which this entry does not freeze. Because the weights are open, Whisper has also been re-implemented and optimized widely by the community.

Distribution is open weights under a permissive license, which is what sets Whisper apart from OpenAI’s API-only model lines and made it a default building block for transcription.

Why business readers should care: Whisper sharply lowered the cost of building transcription, captioning, and voice-interface features. Because it is open and permissively licensed, organizations can run high-quality speech recognition on their own infrastructure without per-call API fees or sending audio to a third party.

Whisper (OpenAI speech-recognition model family)

Sources

Related