The refinement phase that transforms raw language models into helpful assistants
Core Idea: Post-training is the process of aligning pre-trained language models with human preferences through supervised fine-tuning and reinforcement learning, giving them their assistant-like persona and capabilities.
Key Elements
Supervised Fine-Tuning (SFT)
- Models are trained on human-generated examples of helpful responses
- Datasets consist of question-answer pairs created by human labelers
- Training shifts the model from predicting internet text to producing assistant-like responses
- This phase establishes the basic "personality" and response style of the model
- SFT typically requires much less data than pre-training (thousands to millions of examples)
Reinforcement Learning from Human Feedback (RLHF)
- Human labelers rank multiple model outputs from best to worst
- A reward model is trained to predict human preferences
- The language model is optimized to maximize this reward function
- This process improves helpfulness, harmlessness, and honesty
- Multiple iterations of feedback may be incorporated
Constitutional AI (CAI)
- Alternative to RLHF using principles rather than direct human feedback
- Models critique their own outputs based on constitutional principles
- Self-supervision reduces dependence on human labelers
- Can be combined with RLHF for better results
- Helps establish ethical boundaries and response quality
Thinking/Reasoning Enhancement
- Models are specifically trained to show step-by-step reasoning
- "Chain of thought" patterns are encouraged through reinforcement
- Models discover effective thinking strategies through practice
- This improves performance on math, coding, and complex reasoning tasks
- Allows models to "think before answering" for higher accuracy
Connections
- Related Concepts: LLM Pre-training (the foundational phase preceding post-training), LLM Thinking Models (models optimized for reasoning capabilities)
- Broader Context: AI Alignment (ensuring AI systems act in accordance with human values)
- Applications: LLM Tool Use (enhanced through post-training for specific capabilities)
- Components: Reinforcement Learning from AI Feedback (RLAIF, a variant of RLHF using AI evaluators)
References
- Anthropic's Constitutional AI paper
- OpenAI's InstructGPT paper describing RLHF
- DeepSeek's research on incentivizing reasoning capabilities in LLMs
#LLM #post-training #RLHF #alignment #fine-tuning
Connections:
Sources: