Between 2020 and 2024, LLM capabilities grew predictably with model size, training data, and compute — relationships formalized as scaling laws. These laws guided billions in AI investment, and their apparent limits in 2024–2026 triggered the shift to reasoning models that scale inference compute instead.

In 2020, OpenAI researchers published a landmark paper showing that language model performance improves predictably as you scale three factors: model size (parameters), training data (tokens), and compute (FLOPs). The relationships were power laws — doubling any one factor produced a specific, measurable improvement in loss. This shaped the next five years of AI. Companies raised billions to buy more GPUs and train larger models because the scaling laws predicted better performance for more compute. GPT-3 at 175B parameters validated the approach; GPT-4, Claude, and Gemini pushed further. In 2022, DeepMind's Chinchilla paper refined the laws, showing that earlier models were undertrained — for optimal compute efficiency, model size and training tokens should scale roughly equally. This changed how frontier models were built. The surprising twist came in 2024: scaling laws started hitting limits. Returns on simply making models bigger diminished, training data approached the frontier of available high-quality text on the internet, and the marginal cost of each capability gain rose sharply. The field pivoted to a new scaling dimension: inference-time compute. Reasoning models (o1, o3, R1, Claude with extended thinking) spend more compute at inference time by generating extensive chain-of-thought before responding. This opened a new scaling frontier — the amount of thinking time applied per query — that appears to have its own power-law improvements. The practical implication for 2026: raw model size matters less than it did; how much the model can think matters more.

The Scaling Laws That Shaped LLM Development