WeeBytes
Start for free
RLHF: How ChatGPT Learned to Be Helpful
AdvancedLanguage AIKnowledge

RLHF: How ChatGPT Learned to Be Helpful

Pre-training gives a model knowledge. RLHF (Reinforcement Learning from Human Feedback) gives it alignment — teaching it to be helpful, harmless, and honest.

**The problem:** A pre-trained LLM predicts text. It'll complete 'How do I make a bomb?' with equal enthusiasm as 'Explain photosynthesis'. Not ideal.

**RLHF solution (3 steps):**

1. **SFT (Supervised Fine-Tuning)**: Fine-tune on human-written ideal responses

2. **Reward Model**: Train a model to score response quality from human preferences (A vs B)

3. **PPO training**: Use reinforcement learning to optimise the LLM to maximise reward model scores

The result: a model that prefers helpful, safe responses — because those get high reward.

Current frontier: RLAIF (AI feedback), Constitutional AI, DPO (Direct Preference Optimization — simpler than PPO).

llmnlpdeep-learninglarge-language-model

Want more like this?

WeeBytes delivers 25 cards like this every day — personalised to your interests.

Start learning for free