The Transformer ditched RNNs' sequential processing and introduced **self-attention**: every token can attend to every other token simultaneously.
**Key components:**
- **Self-attention**: weighs how relevant each word is to every other word in the sequence
- **Multi-head attention**: multiple attention heads look for different relationship types
- **Position encoding**: since attention has no order, positions are injected as signals
- **Feed-forward layers**: process each token independently after attention
**Why it matters:** Parallelisable (trains fast on GPUs), scales incredibly well — the larger the model, the better it gets.