“Listen, Attend and Spell,” submitted to arXiv on August 5, 2015 by William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals at Google, introduced an attention-based encoder-decoder for speech recognition. A “listener” encoder reads the audio spectrogram into a compact representation, and a “speller” decoder emits the transcript one character at a time, using an attention mechanism to focus on the relevant parts of the audio for each character.
Crucially, the model produces characters without assuming independence between them, an advance over earlier CTC-based systems, and it learns every component jointly rather than stitching together separate acoustic and language modules. The paper reported a 14.1 percent word error rate without a language model and 10.3 percent with rescoring.
Why business readers should care: Listen, Attend and Spell helped establish the attention-based sequence-to-sequence template that underlies much of modern speech recognition and, more broadly, the encoder-decoder architectures later generalized by the Transformer. It is a key step from rigid pipelines toward flexible models that learn directly from audio to text.