What is Post-Training, Really?¶
Imagine you’ve trained a fancy autocomplete machine to predict what word comes next in any sentence. You feed it the entire internet, and now it can complete any text you throw at it. Impressive!
But it doesn’t know it’s supposed to help you. Ask it “What’s the capital of France?” and it might just continue with “What’s the capital of Germany? What’s the capital of Italy?” Because that’s what text on the internet looks like — lists of similar questions.
This is the problem with pre-trained models. They’re brilliant at language, but they don’t know they’re supposed to be assistants.
Post-training (also called fine-tuning or alignment) is how we fix this. It’s the process of teaching a pre-trained language model to:
Follow instructions — When you ask a question, it should answer (not just complete your sentence)
Align with human preferences — Generate responses humans actually like
Refuse harmful requests — Say “no” to dangerous or unethical tasks
Be truthful — Admit when it doesn’t know something
This is what transforms a base model (fancy autocomplete) into an assistant (actually helpful).
Think of it like this: pre-training teaches you grammar and vocabulary by reading every book in the library. Post-training is like going to charm school to learn how to have a conversation.
The Post-Training Pipeline¶
Modern AI assistants like GPT-4, Claude, and Llama all go through the same basic journey. It’s a three-stage process, and each stage builds on the last:
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 1: PRE-TRAINING │
│ Train on massive text corpus (like, the whole internet) │
│ Goal: Learn to predict the next word │
│ Result: A really good autocomplete │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 2: SUPERVISED FINE-TUNING (SFT) │
│ Train on thousands of (instruction → response) examples │
│ Goal: Learn to follow instructions │
│ Result: A model that acts like an assistant │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ STAGE 3: PREFERENCE ALIGNMENT │
│ Option A — RLHF: Train reward model, optimize with PPO │
│ Option B — DPO: Directly learn from preference pairs │
│ Goal: Match what humans actually want │
│ Result: A model that's helpful, harmless, and honest │
└─────────────────────────────────────────────────────────────────────┘Let’s break down what each stage actually does (we’ll go deep on all of these later, don’t worry).
What You’ll Learn¶
We’re going to implement the complete post-training pipeline, from scratch, with real code you can run and modify. Here’s the journey:
Why Post-Training Matters — See the difference between base models and aligned models (it’s dramatic)
Supervised Fine-Tuning (SFT) — Train a model to follow instructions using example conversations
Reward Modeling — Teach a model to predict which responses humans prefer
RLHF with PPO — Use Reinforcement Learning from Human Feedback with Proximal Policy Optimization (yeah, it’s a mouthful — we’ll explain)
Direct Preference Optimization (DPO) — A simpler, more stable alternative to RLHF (this is the hot new thing)
Advanced Topics — Memory optimization, hyperparameter tuning, and how to actually evaluate these models
By the end, you’ll understand exactly how models like GPT-4 and Claude are built. Not just conceptually — you’ll have working code.
# First things first — let's check our environment
# (Making sure we have the tools we need)
import torch
import torch.nn as nn
from transformers import AutoTokenizer, AutoModelForCausalLM
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA device: {torch.cuda.get_device_name(0)}")
print("(Nice! We've got a GPU. Training will be much faster.)")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
print("MPS (Apple Silicon) available")
print("(Apple Silicon! Also great for training.)")
else:
print("(No GPU detected — training will be slower but still works!)")PyTorch version: 2.10.0.dev20251124+rocm7.1
CUDA available: True
CUDA device: Radeon RX 7900 XTX
(Nice! We've got a GPU. Training will be much faster.)
The Three Stages, Side by Side¶
Here’s a quick reference for what we’re building. Don’t worry if the math looks scary — we’ll explain every symbol when we get there.
| Stage | Training Data | What We’re Optimizing | What We Get |
|---|---|---|---|
| SFT | (instruction, response) pairs | Maximize P(response | instruction) | Model that follows instructions |
| Reward Model | (prompt, chosen, rejected) triples | Predict: chosen > rejected | Model that scores responses |
| RLHF/DPO | Prompts + preference data | Maximize expected reward | Model aligned with human values |
A few notes on that table:
P(response | instruction) just means “the probability of generating this response, given this instruction” — in other words, we’re teaching the model to imitate good examples
“chosen > rejected” means we’re training the model to give higher scores to responses humans prefer
“expected reward” is the average score the model gets — we want responses that the reward model thinks are good (but we’ll see why this gets tricky)
The Three Training Methods (In Plain English)¶
Let’s talk about what each method actually does, without the jargon.
Supervised Fine-Tuning (SFT)¶
This is the simplest approach: show the model good examples and train it to imitate them.
You give it thousands of pairs like:
Instruction: “What’s the capital of France?”
Response: “The capital of France is Paris.”
And the model learns: “Oh, when someone asks me a question, I should answer it directly.” Simple! Effective! But limited by the quality and diversity of your examples.
It’s like learning to cook by following recipes. You’ll get good at the dishes you practiced, but you might struggle with variations.
Reinforcement Learning from Human Feedback (RLHF)¶
This is where things get interesting (and complicated).
First, you train a reward model to predict which responses humans prefer. You show it pairs like:
Prompt: “Explain quantum computing”
Response A: “Quantum computers use qubits...” ✓ (humans prefer this)
Response B: “Idk lol” ✗ (humans reject this)
Then you use PPO (Proximal Policy Optimization — a reinforcement learning algorithm) to train your language model to generate responses that score highly on the reward model.
It’s like learning to cook by having a food critic taste everything and give you feedback. You experiment, get scored, and gradually learn what people like.
The downside? This is complicated. You need two models (language model + reward model), and PPO is notoriously finicky to tune.
Direct Preference Optimization (DPO)¶
DPO is the new kid on the block, and it’s elegant as hell.
The key insight: you can skip the reward model entirely! Instead of:
Train reward model on preferences
Use RL to optimize language model against reward model
You just:
Directly optimize the language model on preference pairs
It reformulates the whole RLHF pipeline as a simple classification loss. Same results, way simpler, more stable training.
It’s like learning to cook by comparing your dishes to reference examples: “My version should taste more like the good example and less like the bad example.” No critic needed — you learn directly from the comparisons.
(We’ll implement all three methods, so you can see the tradeoffs yourself.)
# Let me show you the difference in action
# (This is simulated, but it's exactly what you'd see with real models)
prompt = "What is the capital of France?"
# What a BASE MODEL does (just autocomplete)
base_completion = """What is the capital of France? What is the capital of Germany?
What is the capital of Italy? These are common geography questions that students
often struggle with. Let's explore the capitals of European countries..."""
# What an INSTRUCTION-TUNED MODEL does (actually helpful)
instruct_response = """The capital of France is Paris. It's located in the
north-central part of the country along the Seine River."""
print("═" * 70)
print("BASE MODEL (just completes text):")
print("═" * 70)
print(f"Input: {prompt}")
print(f"\nOutput: {base_completion}")
print()
print("═" * 70)
print("INSTRUCTION-TUNED MODEL (answers questions):")
print("═" * 70)
print(f"Input: {prompt}")
print(f"\nOutput: {instruct_response}")
print()
print("See the difference? The base model treats your question like")
print("the beginning of an article. The tuned model actually helps you.")
print("That's the magic of post-training!")══════════════════════════════════════════════════════════════════════
BASE MODEL (just completes text):
══════════════════════════════════════════════════════════════════════
Input: What is the capital of France?
Output: What is the capital of France? What is the capital of Germany?
What is the capital of Italy? These are common geography questions that students
often struggle with. Let's explore the capitals of European countries...
══════════════════════════════════════════════════════════════════════
INSTRUCTION-TUNED MODEL (answers questions):
══════════════════════════════════════════════════════════════════════
Input: What is the capital of France?
Output: The capital of France is Paris. It's located in the
north-central part of the country along the Seine River.
See the difference? The base model treats your question like
the beginning of an article. The tuned model actually helps you.
That's the magic of post-training!
Let’s Begin!¶
In the notebooks that follow, we’ll implement each component of the post-training pipeline with real, runnable code. You’ll see exactly how it works, not just in theory, but in practice.
We’ll start small (fine-tuning a tiny model on a simple task) and build up to the full pipeline (SFT → Reward Model → RLHF/DPO). By the end, you’ll understand how modern AI assistants are built, from first principles.
Ready? Let’s go.
Next up: Why post-training matters (with examples that’ll make it click).