Contextual understanding in transformers comes from self-attention, a mechanism where every token attends to every other token in the input simultaneously. Understanding how attention produces context-sensitive word meanings reveals why transformers dominate NLP and where their capabilities have inherent limits.

Self-attention is the mechanism that makes contextual understanding possible — and it's worth understanding in concrete terms. When a transformer processes the sentence 'She deposited her check at the bank,' each word is first converted to an initial vector (the token embedding). Then, in each attention layer, every token's representation is updated by looking at all other tokens. Mechanically, each token produces three vectors: a query (what am I looking for?), a key (what do I offer?), and a value (what information do I carry?). For every pair of tokens, we compute the similarity between one token's query and the other's key — this produces an attention weight. Higher similarity means 'pay more attention to this token.' The new representation of each token is the weighted average of all other tokens' values, using attention weights. For 'bank' in our sentence, the attention mechanism discovers (through training) that 'deposited' and 'check' are the relevant context signals, and weights them heavily — pulling 'bank' toward its financial meaning. Multi-head attention runs multiple attention operations in parallel, each potentially capturing different relationship types — syntactic dependencies, semantic similarity, coreference. Stacking many attention layers allows the model to build increasingly abstract, context-aware representations. The limit of self-attention is computational: it scales quadratically with sequence length, which is why long-context models require optimizations like Flash Attention, sliding window attention, or approximate attention methods to remain tractable.

How Self-Attention Powers Contextual Understanding: A Mechanism Walkthrough