Conformer: Convolution-augmented Transformer for Speech Recognition

“Conformer: Convolution-augmented Transformer for Speech Recognition,” submitted to arXiv on May 16, 2020 by Anmol Gulati, Ruoming Pang, and colleagues at Google, blended two complementary tools. Self-attention, from the Transformer, is good at capturing long-range, global structure, while convolutions excel at local patterns. The Conformer block interleaves both so the model can attend to the whole utterance and the fine-grained acoustic detail at once.

The result set state-of-the-art accuracy on the LibriSpeech benchmark, reaching word error rates of 2.1 percent and 4.3 percent on the test sets without an external language model. The architecture quickly became a default backbone for production speech recognition.

Why business readers should care: Conformer is one of the most widely deployed speech recognition architectures, sitting under many real transcription and voice products. Its design lesson, that combining local and global modeling beats either alone, recurs across modern audio and sequence models.

Sources

Last verified June 7, 2026