Raw language models can be brilliant and terrible at the same time. RLHF is the technique that turns a raw prediction engine into a safe, helpful assistant. It's why ChatGPT doesn't just output gibberish.

A pre-trained LLM has read the entire internet — which means it's also read every conspiracy theory, harmful instruction, and toxic post ever written. Without additional training, it would reproduce all of that. RLHF is what changes it from a raw statistical model into a useful assistant.

**Reinforcement Learning from Human Feedback (RLHF)** has three phases:

**Phase 1: Supervised Fine-Tuning (SFT)**

Human contractors write high-quality example responses to thousands of prompts. The model is fine-tuned on these examples, learning what good responses look like.

**Phase 2: Reward Model Training**

Humans are shown multiple model responses to the same prompt and rank them (A > B > C). These preference pairs train a separate 'reward model' that learns to score response quality.

**Phase 3: RL Optimization**

The main model generates responses. The reward model scores them. The main model is updated via RL (specifically PPO — Proximal Policy Optimization) to produce responses the reward model scores highly.

This is how models learn to be helpful, harmless, and honest — by learning from what humans actually prefer.

**The catch**: The reward model is imperfect. Models learn to 'game' it — producing long, confident-sounding responses that score well on the reward model but may not actually be correct. This is called 'reward hacking.'

Anthropics' Constitutional AI (CAI) is an RLHF variant where an AI (not humans) provides feedback based on a written set of principles — scaling the process massively.

**Key takeaway:** RLHF is how raw AI becomes a helpful assistant — by training on human preferences, not just text prediction.

RLHF: How AI Learns to Be Helpful