Deep Learning Introduction, undefined

Self-attention Mechanisms

Self-attention is the revolutionary mechanism at the heart of transformer models, enabling them to weigh the importance of different words in relation to each other when processing language. Unlike previous approaches that maintained fixed contexts, attention dynamically focuses on relevant pieces of information regardless of their position in the sequence.

To understand self-attention, imagine reading a sentence where the meaning of one word depends on another word far away. For example, in 'The trophy didn't fit in the suitcase because it was too big,' what does 'it' refer to? A human reader knows 'it' means the trophy, not the suitcase—because trophies can be 'big' in a way that prevents fitting. Self-attention gives neural networks this same ability to connect related words and resolve such ambiguities.

The mechanism works through a brilliant mathematical formulation. For each position in a sequence, the model creates three vectors—a query, key, and value. You can think of the query as a question being asked by a word: "Which other words should I pay attention to?" Each key represents a potential answer to that question. By computing the dot product between the query and all keys, the model determines which other words are most relevant. These relevance scores are then used to create a weighted sum of the value vectors, producing a context-aware representation.

This approach offers several key advantages: it operates in parallel across the entire sequence (enabling efficient training), captures relationships regardless of distance (solving the long-range dependency problem), and provides interpretable attention weights that show which words the model is focusing on when making predictions.

Beyond its technical elegance, self-attention represents a profound shift in how neural networks process sequential data—from the rigid, distance-penalizing approaches of the past to a flexible, content-based mechanism that better mirrors human understanding. This paradigm shift unlocked capabilities in language understanding that had remained elusive for decades.