LLM Post-training

Core Idea: Post-training is the process of aligning pre-trained language models with human preferences through supervised fine-tuning and reinforcement learning, giving them their assistant-like persona and capabilities.

Key Elements

Supervised Fine-Tuning (SFT)

Models are trained on human-generated examples of helpful responses
Datasets consist of question-answer pairs created by human labelers
Training shifts the model from predicting internet text to producing assistant-like responses
This phase establishes the basic "personality" and response style of the model
SFT typically requires much less data than pre-training (thousands to millions of examples)

Reinforcement Learning from Human Feedback (RLHF)

Human labelers rank multiple model outputs from best to worst
A reward model is trained to predict human preferences
The language model is optimized to maximize this reward function
This process improves helpfulness, harmlessness, and honesty
Multiple iterations of feedback may be incorporated

Constitutional AI (CAI)

Alternative to RLHF using principles rather than direct human feedback
Models critique their own outputs based on constitutional principles
Self-supervision reduces dependence on human labelers
Can be combined with RLHF for better results
Helps establish ethical boundaries and response quality

Thinking/Reasoning Enhancement

Models are specifically trained to show step-by-step reasoning
"Chain of thought" patterns are encouraged through reinforcement
Models discover effective thinking strategies through practice
This improves performance on math, coding, and complex reasoning tasks
Allows models to "think before answering" for higher accuracy

Connections

Related Concepts: LLM Pre-training (the foundational phase preceding post-training), LLM Thinking Models (models optimized for reasoning capabilities)
Broader Context: AI Alignment (ensuring AI systems act in accordance with human values)
Applications: LLM Tool Use (enhanced through post-training for specific capabilities)
Components: Reinforcement Learning from AI Feedback (RLAIF, a variant of RLHF using AI evaluators)

References

Anthropic's Constitutional AI paper
OpenAI's InstructGPT paper describing RLHF
DeepSeek's research on incentivizing reasoning capabilities in LLMs

#LLM #post-training #RLHF #alignment #fine-tuning

Connections:

Sources:

From: Andrej Karpathy - How I use LLMs