Encoder-Decoder Models

Transformer-based encoder-decoder models combine elements of both BERT-like bidirectional encoders and GPT-like autoregressive decoders to create architectures optimized for sequence-to-sequence tasks like translation, summarization, and question answering.

In these models, the encoder processes the entire input sequence bidirectionally to create context-rich representations, while the decoder generates the output sequence autoregressively, attending both to previously generated tokens and the encoded input through cross-attention mechanisms.

T5 (Text-to-Text Transfer Transformer) exemplifies this approach by framing all NLP tasks as text-to-text problems, using a consistent encoder-decoder architecture regardless of whether the task involves classification, translation, or question answering. This unified approach allows a single model to handle diverse tasks through appropriate prompting.

BART (Bidirectional and Auto-Regressive Transformer) similarly combines a bidirectional encoder with an autoregressive decoder but introduces more sophisticated pre-training techniques like text infilling, where spans of text are replaced with a single mask token. This approach created models particularly effective for tasks requiring both understanding and generation capabilities.