Multi-head attention runs several attention operations in parallel, letting the model simultaneously capture syntactic structure, semantic relationships, and coreference. Positional encoding solves a key problem: since attention is order-agnostic, position information must be explicitly injected. These two mechanisms together define transformer expressiveness.

Single-head attention is powerful but limited: it can only represent one type of token-to-token relationship per layer. Multi-head attention runs h attention heads in parallel — each initialized differently and learning to specialize in different relationship types. One head might learn to capture syntactic dependencies (subject-verb agreement), another semantic similarity, another coreference resolution (connecting 'she' to 'Marie Curie' earlier in the text). The outputs from all heads are concatenated and projected back to the model's hidden dimension. In practice, attention head specialization emerges from training rather than being prescribed — and ablation studies show that different heads learn genuinely different functions, with some heads consistently more important than others for specific task types. Positional encoding addresses a fundamental property of self-attention: because every position attends to every other position simultaneously, the operation is inherently permutation-invariant — it doesn't know that token 5 comes after token 4. Original transformers used fixed sinusoidal positional encodings added to token embeddings. Learned absolute positions were later adopted. Relative positional encodings (RoPE, ALiBi) represent position as the distance between tokens rather than absolute index — this generalizes better to sequences longer than those seen during training, which is why RoPE is used in most modern long-context models including LLaMA and Mistral. Flash Attention is the engineering optimization that makes attention computationally feasible at scale: by reordering the computation to minimize memory bandwidth, it reduces attention's memory requirement from O(n²) to O(n) and runs 2–4x faster on GPU hardware.

Multi-Head Attention and Positional Encoding: Inside the Transformer