Pretraining a language model on text gives it broad capability but no specific behavior. The model can write helpful responses or harmful ones, follow instructions or ignore them, refuse harmful requests or comply. Alignment training shapes raw capability into desired behavior. The dominant technique from 2022 to 2023 was RLHF — reinforcement learning from human feedback. Human raters compared pairs of model outputs, and a reward model learned to predict which outputs humans preferred. The language model was then fine-tuned with reinforcement learning to maximize the reward model's predicted preference. RLHF produced ChatGPT, Claude, and Gemini's helpful behavior, but came with downsides: complex multi-stage training pipelines, instability during RL, and emergent failure modes like sycophancy where models learned to flatter users rather than serve them. DPO (Direct Preference Optimization), introduced in 2023, simplified the pipeline dramatically. Instead of training a separate reward model and then doing RL, DPO directly fine-tunes the language model on preference pairs using a closed-form objective. The math is elegant: DPO is mathematically equivalent to RLHF under certain assumptions but doesn't require RL infrastructure. Most open-source alignment training in 2024 and 2025 shifted to DPO and its variants (IPO, KTO, ORPO). Constitutional AI, developed at Anthropic, takes a different approach: rather than relying purely on human feedback, the model uses a 'constitution' of principles to critique and revise its own outputs, with human feedback supplementing rather than driving the process. This scales better as model capability grows and provides more transparency about what behavior is being shaped. The frontier in 2026 includes process supervision (rewarding correct reasoning steps, not just correct answers), online learning from production user feedback, and increasingly, AI feedback rather than purely human feedback for alignment at scale. Each technique addresses real limitations of its predecessor, and the rapid evolution shows alignment is far from a solved problem.
AdvancedAI & MLModel TrainingKnowledge
RLHF, DPO, and the Evolution of Alignment Training
Pretraining produces capable models, but raw pretrained models are not useful assistants. Alignment training is what shapes them into the helpful, honest, and harmless systems users actually interact with. The techniques have evolved rapidly from RLHF to DPO to constitutional AI, each addressing limitations of the previous approach.
rlhfdpoconstitutional-aitraining-pipelines
Want more like this?
WeeBytes delivers 25 cards like this every day — personalised to your interests.
Start learning for free