Every major AI model today — GPT, Claude, Gemini, BERT — is built on the Transformer architecture introduced in the 2017 paper 'Attention Is All You Need'.

The Transformer ditched RNNs' sequential processing and introduced **self-attention**: every token can attend to every other token simultaneously.

**Key components:**

- **Self-attention**: weighs how relevant each word is to every other word in the sequence

- **Multi-head attention**: multiple attention heads look for different relationship types

- **Position encoding**: since attention has no order, positions are injected as signals

- **Feed-forward layers**: process each token independently after attention

**Why it matters:** Parallelisable (trains fast on GPUs), scales incredibly well — the larger the model, the better it gets.

Transformers: The Architecture Behind Modern AI