Encoder-Decoder Architecture

The encoder-decoder architecture splits a sequence-to-sequence task into two stages. An encoder reads the entire input - a sentence, an audio clip, an image - and compresses it into an internal representation. A decoder then takes that representation and generates an output sequence one element at a time. The design was popularized for machine translation in 2014, most influentially in “Sequence to Sequence Learning with Neural Networks” by Sutskever, Vinyals, and Le, which used one recurrent network to read a source sentence and a second to emit the translation.

The split is natural because the input and output can have different lengths and even different alphabets: a five-word English sentence might become an eight-word French one. Keeping a dedicated reader and a dedicated writer lets each specialize. The early versions had a famous weakness - the encoder had to cram everything into one fixed-length vector - which is exactly the bottleneck that the attention mechanism was invented to relieve, by letting the decoder look back at all of the encoder’s intermediate states.

The original Transformer was itself an encoder-decoder, and the pattern still dominates translation and speech models. Many modern large language models, by contrast, are decoder-only: they drop the separate encoder and treat the prompt and the response as one continuous sequence. Models like T5 deliberately kept the full encoder-decoder form and framed every task as text-in, text-out.

Why business readers should care: encoder-decoder is the shape behind translation, summarization, and speech-to-text systems. Recognizing whether a model is encoder-decoder or decoder-only helps explain what it is built to do and how it handles the relationship between input and output.

Sources

Related