The Complete Pipeline¶
Alright, let’s talk about what happens after you train a base model.
You’ve got a language model that can complete text. Great! But it doesn’t follow instructions. It doesn’t know what you actually want when you ask it a question. It’s like a really smart autocomplete machine that just continues whatever pattern you started.
So how do we go from “autocomplete machine” to “helpful assistant”?
Three stages. Each builds on the last:
┌─────────────────────────────────────────────────────────────────────┐
│ SUPERVISED FINE-TUNING (SFT) │
├─────────────────────────────────────────────────────────────────────┤
│ Teaching the model to follow instructions │
│ │
│ • Show it examples: "When asked X, respond with Y" │
│ • Use chat templates to format conversations │
│ • Only train on the responses (not the questions) │
│ • LoRA keeps this efficient (we'll explain later) │
│ │
│ Analogy: Teaching someone the *format* of good answers │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ REWARD MODELING │
├─────────────────────────────────────────────────────────────────────┤
│ Teaching the model what "good" means │
│ │
│ • Show it pairs: "This answer is better than that one" │
│ • Train it to score responses (higher = better) │
│ • Bradley-Terry loss (fancy ranking math) │
│ • Evaluation: does it rank things the way humans would? │
│ │
│ Analogy: Teaching a judge to score gymnastics routines │
└─────────────────────────────────────────────────────────────────────┘
│
┌───────────────┴───────────────┐
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐
│ RLHF │ │ DPO │
├─────────────────────────┤ ├─────────────────────────┤
│ Two-stage approach │ │ One-stage shortcut │
│ │ │ │
│ • Train reward model │ │ • Skip reward model │
│ • Use it to train │ │ • Optimize preferences │
│ policy with PPO │ │ directly │
│ • Complex but powerful │ │ • Simpler, faster │
│ • Needs 4 models (!) │ │ • Only needs 2 models │
│ │ │ │
│ Classic method │ │ Modern alternative │
└─────────────────────────┘ └─────────────────────────┘RLHF = Reinforcement Learning from Human Feedback
DPO = Direct Preference Optimization
PPO = Proximal Policy Optimization (the RL algorithm RLHF uses)
The end goal? A model that doesn’t just follow instructions, but follows them well. Helpful, harmless, and honest (as the saying goes).
Module Organization¶
Our code is split into clean modules. Each one handles a different stage of the pipeline.
| Module | Purpose | Key Functions |
|---|---|---|
sft/ | Supervised Fine-Tuning | SFTTrainer, format_instruction |
reward/ | Reward Model Training | RewardModel, RewardModelTrainer |
rlhf/ | RLHF with PPO | PPOTrainer, ValueNetwork, RolloutBuffer |
dpo/ | Direct Preference Optimization | DPOTrainer, compute_dpo_loss |
utils/ | Shared Utilities | load_model_and_tokenizer, setup_device |
Think of these as separate kitchens in a restaurant. Each one specializes in a different course of the meal. You don’t make dessert where you’re grilling steaks (though I suppose you could...probably shouldn’t).
Data Formats¶
With training pipelines, each stage speaks a different language. Not literally, but in terms of what data format it expects.
Let’s break them down.
SFT (Supervised Fine-Tuning) Data¶
Simple input-output pairs. Question and answer. Instruction and response.
{
"instruction": "What is the capital of France?",
"response": "The capital of France is Paris."
}Dead simple. The model learns “when you see this format of question, generate this format of answer.”
Preference Data (for Reward Model & DPO)¶
Now it gets spicy. Instead of just one answer, we show the model two answers to the same prompt. One good, one bad.
{
"prompt": "Explain quantum computing simply.",
"chosen": "Imagine a coin spinning in the air—it's both heads and tails until it lands. Quantum computers work with information in that 'spinning' state, processing multiple possibilities simultaneously.",
"rejected": "Quantum computers use qubits which leverage quantum superposition and entanglement to perform computations exponentially faster than classical computers by exploiting quantum mechanical phenomena."
}See the difference? The “chosen” response is simple, clear, uses an analogy. The “rejected” one? Technically accurate but sounds like it swallowed a physics textbook.
The model learns: “When comparing these two, rank the first one higher.”
Prompt Data (for RLHF)¶
Once we have a reward model, we can just give it prompts and let it generate responses, then score them.
{
"prompt": "Write a haiku about programming."
}The model generates completions, the reward model scores them, and we use those scores to improve the policy. Rinse and repeat.
(We’ll see this in action much later.)
# Let's make these concrete with actual Python data structures
# SFT data: instruction-response pairs
sft_data = [
{
"instruction": "What is Python?",
"response": "Python is a high-level programming language known for its readability and simplicity. It's great for beginners and powerful enough for experts."
},
{
"instruction": "Translate 'hello' to French",
"response": "'Hello' in French is 'Bonjour'."
}
]
# Preference data: one prompt, two competing responses
preference_data = [
{
"prompt": "Explain artificial intelligence briefly.",
"chosen": "AI is technology that enables machines to simulate human intelligence—learning from experience, recognizing patterns, and making decisions.",
"rejected": "AI." # Technically correct but...useless
}
]
# Let's look at what these actually contain
print("SFT Data Format:")
print(f" Keys: {list(sft_data[0].keys())}")
print(f" Example instruction: \"{sft_data[0]['instruction']}\"")
print()
print("Preference Data Format:")
print(f" Keys: {list(preference_data[0].keys())}")
print(f" Chosen response length: {len(preference_data[0]['chosen'])} chars")
print(f" Rejected response length: {len(preference_data[0]['rejected'])} chars")
print()
print("(Notice how the rejected response is way shorter? Sometimes bad answers are just...lazy.)")SFT Data Format:
Keys: ['instruction', 'response']
Example instruction: "What is Python?"
Preference Data Format:
Keys: ['prompt', 'chosen', 'rejected']
Chosen response length: 139 chars
Rejected response length: 3 chars
(Notice how the rejected response is way shorter? Sometimes bad answers are just...lazy.)
Training Progression¶
So you’ve got a base model. Now what?
Here’s the typical path from “raw autocomplete” to “helpful assistant”:
| Step | Input Model | Output Model | What Happens | Training Time* |
|---|---|---|---|---|
| 1. SFT | Base (GPT-2) | SFT Model | Learn to follow instructions | ~1 hour |
| 2. Reward | SFT Model | Reward Model | Learn to judge quality | ~30 min |
| 3a. RLHF | SFT Model + RM | RLHF Model | Optimize for high rewards | ~2 hours |
| 3b. DPO | SFT Model | DPO Model | Optimize preferences directly | ~1 hour |
*Approximate times for GPT-2 (124M params) on a single GPU. Your mileage may vary.
Notice step 3 splits? That’s the fork in the road. You can either:
Go the RLHF route: Train a reward model first, then use reinforcement learning (PPO) to optimize your policy against it. More complex, more moving parts, but this is what OpenAI used for GPT-4.
Go the DPO route: Skip the reward model entirely and optimize preferences directly. Simpler, faster, and honestly? Often just as good. This is the new hotness.
We’ll implement both so you understand the tradeoffs. (Because understanding > blindly following trends.)
Key Hyperparameters¶
Hyperparameters are the dials you turn to make training work. Each stage has different sweet spots.
Let me explain the reasoning behind these numbers (instead of just throwing them at you).
SFT (Supervised Fine-Tuning)¶
Learning rate: 2e-4 (that’s 0.0002)
Higher than pre-training! We’re making bigger updates because we’re teaching a new skill
But not too high or we’ll destroy what the model already knows
Batch size: 4-8
Small because we’re fine-tuning, not pre-training
Larger batches = more stable but more memory
Epochs: 3-5
A few passes through the data is usually enough
Too many and you overfit (model memorizes instead of generalizes)
Reward Model¶
Learning rate: 1e-5 (that’s 0.00001)
Much lower! Reward models are sensitive
We want careful, stable learning of the preference ranking
Batch size: 4 (but each sample has 2 sequences)
We’re comparing pairs, so effective batch size is 8 sequences
Epochs: 1
Just one pass! Reward models overfit easily
If you train too long, they memorize specific preferences instead of learning general quality
RLHF (with PPO)¶
Learning rate: 1e-6 (that’s 0.000001)
Tiny! RL is unstable, we need baby steps
Too high and training collapses (you’ll see divergence, mode collapse, gibberish)
KL coefficient: 0.1
This keeps the model close to the original SFT model
Prevents it from going off the rails chasing reward
PPO epochs: 4
How many times we update on each batch of rollouts
Classic PPO sweet spot
DPO (Direct Preference Optimization)¶
Learning rate: 1e-6
Same as RLHF—we’re doing preference learning, gotta be gentle
Beta (β): 0.1
Controls how strongly we optimize preferences
Higher = more aggressive, lower = more conservative
Epochs: 1-3
DPO is more stable than PPO, can train a bit longer
But still, don’t overdo it
The pattern? As we get further from standard supervised learning, we get more conservative. RL is temperamental.
# Here are those configurations in code
# (so you can see them all in one place)
sft_config = {
"learning_rate": 2e-4, # 0.0002 - higher for teaching new skills
"batch_size": 4,
"num_epochs": 3,
"max_length": 512, # truncate long sequences here
"warmup_steps": 100, # gradually increase LR at start
}
reward_config = {
"learning_rate": 1e-5, # 0.00001 - much lower, reward models are sensitive
"batch_size": 4, # but remember: 2 sequences per sample!
"num_epochs": 1, # just one pass to avoid overfitting
"max_length": 512,
}
ppo_config = {
"learning_rate": 1e-6, # 0.000001 - tiny! RL is unstable
"batch_size": 4,
"ppo_epochs": 4, # how many times to update per rollout batch
"kl_coef": 0.1, # keeps us close to reference model
"clip_ratio": 0.2, # PPO clipping (prevents huge updates)
}
dpo_config = {
"learning_rate": 1e-6, # same as PPO
"batch_size": 4,
"num_epochs": 1, # conservative - can go up to 3 if needed
"beta": 0.1, # preference optimization strength
}
print("Configuration summary:")
print(f" SFT learning rate: {sft_config['learning_rate']:.6f}")
print(f" Reward learning rate: {reward_config['learning_rate']:.6f}")
print(f" PPO learning rate: {ppo_config['learning_rate']:.6f}")
print(f" DPO learning rate: {dpo_config['learning_rate']:.6f}")
print()
print("Notice the pattern? Learning rates get smaller as training gets trickier.")Configuration summary:
SFT learning rate: 0.000200
Reward learning rate: 0.000010
PPO learning rate: 0.000001
DPO learning rate: 0.000001
Notice the pattern? Learning rates get smaller as training gets trickier.
Memory Considerations¶
Here’s the dirty secret about post-training: it’s expensive.
Not money expensive (well, also that), but memory expensive. Let me break down why.
| Stage | Models in Memory | Memory Factor | What’s Loaded |
|---|---|---|---|
| SFT | 1 model | 1x | Just the model we’re training |
| Reward | 1 model | 1x | Just the reward model (but 2 sequences/batch) |
| RLHF | 4 models | 4x | Policy, value net, reward model, reference model |
| DPO | 2 models | 2x | Policy, reference model |
See why RLHF is so painful? Four models in memory at once:
Policy model - the one we’re actually training
Value network - estimates expected future reward (RL thing)
Reward model - scores our generations
Reference model - the original SFT model we’re trying not to drift too far from
DPO cuts this in half by skipping the reward model and value network. Just policy + reference.
How to fit this on a single GPU?¶
We’ve got tricks:
LoRA (Low-Rank Adaptation)
Instead of updating all parameters, we add small trainable adapters
Massively reduces memory for gradients and optimizer states
Like teaching someone by giving them a cheat sheet instead of rewriting their brain
Gradient checkpointing
Trade computation for memory
Recompute activations during backward pass instead of storing them
Slower but fits in VRAM
Mixed precision (fp16/bf16)
Use 16-bit floats instead of 32-bit
Cuts memory in half (roughly)
Modern GPUs are built for this
Gradient accumulation
Simulate larger batches by accumulating gradients over multiple steps
Doesn’t reduce peak memory but improves training stability
“I can’t lift 100 pounds at once, but I can make four trips with 25 pounds each”
Without these tricks? You’d need a data center. With them? You can do this on a consumer GPU.
(Well, for GPT-2 scale. If you want to fine-tune Llama-70B...get your credit card ready.)
Next Steps¶
Alright, you’ve got the bird’s eye view of the entire pipeline.
We’re going from base model → instruction-following → preference-aligned. Three stages (or four, if you count the RLHF/DPO fork).
Now let’s get our hands dirty.
Next up: Supervised Fine-Tuning (SFT). We’ll teach our model to follow instructions, format responses properly, and actually be useful.
It’s the foundation everything else builds on. Get this right, and the rest flows naturally. Get it wrong, and...well, you’ll be debugging reward models for a week.
Let’s go.