Self-Attention Mechanism

Multi-Head Self-Attention is the core innovation that allows the model to assess every token's relevance to every other token when creating contextualized representations. For each position, attention computes a weighted sum across all positions, with weights determined by learned compatibility between tokens.

Multiple attention heads operate in parallel, each specializing in different relationship types—some focusing on syntax, others on semantics, entity relationships, or discourse patterns. This allows the model to simultaneously capture diverse types of information.

Self-attention represents tokens as queries, keys, and values—conceptually similar to information retrieval where the model determines which input parts are most relevant to each position. When analyzing "bank" in "I deposited money in the bank," the model directly attends to "money" and "deposited" regardless of distance, recognizing the financial context.

This parallel computation captures long-range dependencies more effectively while leveraging hardware designed for parallel computation, addressing a fundamental limitation of previous sequential approaches.