On September 21, 2022, OpenAI released Whisper, described on its announcement page as a system for automatic speech recognition. The official OpenAI Whisper code repository describes Whisper as a general-purpose speech recognition model that is trained on a large dataset of diverse audio and is also a multitasking model capable of multilingual speech recognition, speech translation, and language identification.
The technical details were later published in the paper “Robust Speech Recognition via Large-Scale Weak Supervision” by Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever, submitted December 6, 2022. The work shows that training a Transformer sequence-to-sequence model on a very large and varied set of audio, using weak supervision rather than carefully hand-labeled data, produces a model that generalizes well across many languages and conditions.
Whisper made high-quality, open speech recognition freely available to developers and businesses. Released under a permissive license, it lowered the cost of building transcription, captioning, and voice-driven applications, and it extended the generative-AI era beyond text and images into audio.