GPT (Generative Pre-trained Transformer) models, developed by OpenAI beginning in 2018, applied the transformer architecture to autoregressive language modeling—predicting the next token given previous tokens. This approach created powerful text generation capabilities while still enabling effective transfer learning to downstream tasks.

Unlike BERT's bidirectional approach, GPT models use a causal attention mask that prevents tokens from attending to future positions, preserving the left-to-right generation capability. This unidirectional approach creates models that excel at text generation tasks while still developing useful representations for understanding.

The GPT series has followed a clear scaling trajectory, with each generation substantially increasing parameter count, training data, and capabilities. GPT-1 (2018) demonstrated the potential of the approach, GPT-2 (2019) showed surprisingly coherent text generation, and GPT-3 (2020) revealed emergent capabilities like few-shot learning that weren't explicitly trained.

GPT models have profoundly influenced NLP by demonstrating that a single architecture and training objective can develop general-purpose language capabilities applicable across diverse tasks. Rather than creating specialized architectures for different problems, GPT showed that scaled autoregressive models can perform translation, summarization, question-answering, and even reasoning through simple text prompting.