The Complete Pipeline¶
Alright, so we’re going to build something like this:
┌─────────────────────────────────────────────────────────────────────┐
│ SUPERVISED FINE-TUNING (SFT) │
├─────────────────────────────────────────────────────────────────────┤
│ Teaching the model to follow instructions │
│ │
│ • Show it examples: "When asked X, respond with Y" │
│ • Use chat templates to format conversations │
│ • Only train on the responses (not the questions) │
│ • LoRA keeps this efficient (we'll explain later) │
│ │
│ Analogy: Teaching someone the *format* of good answers │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ REWARD MODELING │
├─────────────────────────────────────────────────────────────────────┤
│ Teaching the model what "good" means │
│ │
│ • Show it pairs: "This answer is better than that one" │
│ • Train it to score responses (higher = better) │
│ • Bradley-Terry loss (fancy ranking math) │
│ • Evaluation: does it rank things the way humans would? │
│ │
│ Analogy: Teaching a judge to score gymnastics routines │
└─────────────────────────────────────────────────────────────────────┘
│
┌───────────────┴───────────────┐
▼ ▼
┌─────────────────────────┐ ┌─────────────────────────┐
│ RLHF │ │ DPO │
├─────────────────────────┤ ├─────────────────────────┤
│ Two-stage approach │ │ One-stage shortcut │
│ │ │ │
│ • Train reward model │ │ • Skip reward model │
│ • Use it to train │ │ • Optimize preferences │
│ policy with PPO │ │ directly │
│ • Complex but powerful │ │ • Simpler, faster │
│ • Needs 4 models (!) │ │ • Only needs 2 models │
│ │ │ │
│ Classic method │ │ Modern alternative │
└─────────────────────────┘ └─────────────────────────┘RLHF = Reinforcement Learning from Human Feedback
DPO = Direct Preference Optimization
PPO = Proximal Policy Optimization (the RL algorithm RLHF uses)
Module Organization¶
Our code is split into clean modules. Each one handles a different stage of the pipeline.
| Module | Purpose | Key Functions |
|---|---|---|
sft/ | Supervised Fine-Tuning | SFTTrainer, format_instruction |
reward/ | Reward Model Training | RewardModel, RewardModelTrainer |
rlhf/ | RLHF with PPO | PPOTrainer, ValueNetwork, RolloutBuffer |
dpo/ | Direct Preference Optimization | DPOTrainer, compute_dpo_loss |
utils/ | Shared Utilities | load_model_and_tokenizer, setup_device |
Think of these as separate kitchens in a restaurant. Each one specializes in a different course of the meal.
Data Formats¶
Each training stage expects a different data format.
SFT (Supervised Fine-Tuning) Data¶
Simple input-output pairs. Question and answer. Instruction and response.
{
"instruction": "What is the capital of France?",
"response": "The capital of France is Paris."
}The model learns “when you see this format of question, generate this format of answer.”
Preference Data (for Reward Model & DPO)¶
Instead of just one answer, we show the model two answers to the same prompt. One good, one bad.
{
"prompt": "Explain quantum computing simply.",
"chosen": "Imagine a coin spinning in the air. It's both heads and tails until it lands. Quantum computers work with information in that 'spinning' state, processing multiple possibilities simultaneously.",
"rejected": "Quantum computers use qubits which leverage quantum superposition and entanglement to perform computations exponentially faster than classical computers by exploiting quantum mechanical phenomena."
}The “chosen” response is simple, clear, uses an analogy. The “rejected” one is technically accurate but sounds like it swallowed a physics textbook.
The model learns: “When comparing these two, rank the first one higher.”
Prompt Data (for RLHF)¶
Once we have a reward model, we can just give it prompts and let it generate responses, then score them.
{
"prompt": "Write a haiku about programming."
}The model generates completions, the reward model scores them, and we use those scores to improve the policy.
# Let's make these concrete with actual Python data structures
# SFT data: instruction-response pairs
sft_data = [
{
"instruction": "What is Python?",
"response": "Python is a high-level programming language known for its readability and simplicity. It's great for beginners and powerful enough for experts."
},
{
"instruction": "Translate 'hello' to French",
"response": "'Hello' in French is 'Bonjour'."
}
]
# Preference data: one prompt, two competing responses
preference_data = [
{
"prompt": "Explain artificial intelligence briefly.",
"chosen": "AI is technology that enables machines to simulate human intelligence. Learning from experience, recognizing patterns, and making decisions.",
"rejected": "AI." # Technically correct but useless
}
]
# Let's look at what these actually contain
print("SFT Data Format:")
print(f" Keys: {list(sft_data[0].keys())}")
print(f" Example instruction: \"{sft_data[0]['instruction']}\"")
print()
print("Preference Data Format:")
print(f" Keys: {list(preference_data[0].keys())}")
print(f" Chosen response length: {len(preference_data[0]['chosen'])} chars")
print(f" Rejected response length: {len(preference_data[0]['rejected'])} chars")SFT Data Format:
Keys: ['instruction', 'response']
Example instruction: "What is Python?"
Preference Data Format:
Keys: ['prompt', 'chosen', 'rejected']
Chosen response length: 140 chars
Rejected response length: 3 chars
Training Progression¶
Here’s the typical path from “raw autocomplete” to “helpful assistant”:
| Step | Input Model | Output Model | What Happens | Training Time* |
|---|---|---|---|---|
| 1. SFT | Base (GPT-2) | SFT Model | Learn to follow instructions | ~1 hour |
| 2. Reward | SFT Model | Reward Model | Learn to judge quality | ~30 min |
| 3a. RLHF | SFT Model + RM | RLHF Model | Optimize for high rewards | ~2 hours |
| 3b. DPO | SFT Model | DPO Model | Optimize preferences directly | ~1 hour |
*Approximate times for GPT-2 (124M params) on a single GPU. Your mileage may vary.
Notice step 3 splits? That’s the fork in the road. You can either:
Go the RLHF route: Train a reward model first, then use reinforcement learning (PPO) to optimize your policy against it. More complex, more moving parts, but this is what OpenAI used for GPT-4.
Go the DPO route: Skip the reward model entirely and optimize preferences directly. Simpler, faster, and often just as good. This is more recent.
We’ll implement both so you understand the tradeoffs.
Key Hyperparameters¶
Hyperparameters are the dials you turn to make training work. Each stage has different sweet spots.
SFT (Supervised Fine-Tuning)¶
Learning rate: 2e-4 (that’s 0.0002)
Higher than pre-training! We’re making bigger updates because we’re teaching a new skill
But not too high or we’ll destroy what the model already knows
Batch size: 4-8
Small because we’re fine-tuning, not pre-training
Larger batches = more stable but more memory
Epochs: 3-5
A few passes through the data is usually enough
Too many and you overfit (model memorizes instead of generalizes)
Reward Model¶
Learning rate: 1e-5 (that’s 0.00001)
Much lower! Reward models are sensitive
We want careful, stable learning of the preference ranking
Batch size: 4 (but each sample has 2 sequences)
We’re comparing pairs, so effective batch size is 8 sequences
Epochs: 1
Just one pass! Reward models overfit easily
If you train too long, they memorize specific preferences instead of learning general quality
RLHF (with PPO)¶
Learning rate: 1e-6 (that’s 0.000001)
Tiny! RL is unstable, we need baby steps
Too high and training collapses (you’ll see divergence, mode collapse, gibberish)
KL coefficient: 0.1
This keeps the model close to the original SFT model
Prevents it from going off the rails chasing reward
PPO epochs: 4
How many times we update on each batch of rollouts
Classic PPO sweet spot
DPO (Direct Preference Optimization)¶
Learning rate: 1e-6
Same as RLHF. We’re doing preference learning, gotta be gentle
Beta (β): 0.1
Controls how strongly we optimize preferences
Higher = more aggressive, lower = more conservative
Epochs: 1-3
DPO is more stable than PPO, can train a bit longer
But still, don’t overdo it
# Here are those configurations in code
# (so you can see them all in one place)
sft_config = {
"learning_rate": 2e-4, # 0.0002 - higher for teaching new skills
"batch_size": 4,
"num_epochs": 3,
"max_length": 512, # truncate long sequences here
"warmup_steps": 100, # gradually increase LR at start
}
reward_config = {
"learning_rate": 1e-5, # 0.00001 - much lower, reward models are sensitive
"batch_size": 4, # but remember: 2 sequences per sample!
"num_epochs": 1, # just one pass to avoid overfitting
"max_length": 512,
}
ppo_config = {
"learning_rate": 1e-6, # 0.000001 - tiny! RL is unstable
"batch_size": 4,
"ppo_epochs": 4, # how many times to update per rollout batch
"kl_coef": 0.1, # keeps us close to reference model
"clip_ratio": 0.2, # PPO clipping (prevents huge updates)
}
dpo_config = {
"learning_rate": 1e-6, # same as PPO
"batch_size": 4,
"num_epochs": 1, # conservative - can go up to 3 if needed
"beta": 0.1, # preference optimization strength
}
print("Configuration summary:")
print(f" SFT learning rate: {sft_config['learning_rate']:.6f}")
print(f" Reward learning rate: {reward_config['learning_rate']:.6f}")
print(f" PPO learning rate: {ppo_config['learning_rate']:.6f}")
print(f" DPO learning rate: {dpo_config['learning_rate']:.6f}")
print()
print("Notice the pattern? Learning rates get smaller as training gets trickier.")Configuration summary:
SFT learning rate: 0.000200
Reward learning rate: 0.000010
PPO learning rate: 0.000001
DPO learning rate: 0.000001
Notice the pattern? Learning rates get smaller as training gets trickier.
Memory Considerations¶
Post-training is expensive. In both money and memory.
| Stage | Models in Memory | Memory Factor | What’s Loaded |
|---|---|---|---|
| SFT | 1 model | 1x | Just the model we’re training |
| Reward | 1 model | 1x | Just the reward model (but 2 sequences/batch) |
| RLHF | 4 models | 4x | Policy, value net, reward model, reference model |
| DPO | 2 models | 2x | Policy, reference model |
RLHF requires four models in memory at once:
Policy model - the one we’re actually training
Value network - estimates expected future reward (RL thing)
Reward model - scores our generations
Reference model - the original SFT model we’re trying not to drift too far from
DPO cuts this in half by skipping the reward model and value network. Just policy + reference.
How to fit this on a single GPU?¶
We’ve got tricks:
LoRA (Low-Rank Adaptation)
Instead of updating all parameters, we add small trainable adapters
Massively reduces memory for gradients and optimizer states
Like teaching someone by giving them a cheat sheet instead of rewriting their brain
Gradient checkpointing
Trade computation for memory
Recompute activations during backward pass instead of storing them
Slower but fits in VRAM
Mixed precision (fp16/bf16)
Use 16-bit floats instead of 32-bit
Cuts memory in half (roughly)
Modern GPUs are built for this
Gradient accumulation
Simulate larger batches by accumulating gradients over multiple steps
Doesn’t reduce peak memory but improves training stability
Without these tricks? You’d need a data center. With them? You can sort of do something at GPT-2 scale.
Next Steps¶
You’ve got the bird’s eye view of the entire pipeline.
We’re going from base model → instruction-following → preference-aligned. Three stages (or four, if you count the RLHF/DPO fork).
Next up: Supervised Fine-Tuning (SFT). We’ll teach our model to follow instructions, format responses properly, and actually be useful.
It’s the foundation everything else builds on.
