Language models are computational systems that learn statistical patterns of language to predict, generate, and understand text. Their evolution shows remarkable capability growth.

Early language models like n-grams were simple probability distributions over word sequences, lacking deeper meaning understanding. Neural networks brought RNNs and LSTMs that tracked longer dependencies, followed by transformers that revolutionized the field by processing entire sequences in parallel while analyzing relationships between distant words.

The scaling revolution—building larger models with more parameters on more data—revealed emergent abilities. As models grew, they developed qualitatively new capabilities like few-shot learning, code generation, and reasoning that weren't explicitly programmed.

These models learn by predicting masked or next tokens in vast text corpora, but in doing so, they absorb knowledge about the world, logical relationships, and reasoning patterns. The boundary between learning language statistics and developing broader intelligence has blurred, raising profound questions about understanding and cognition.