Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Project Overview

The Complete Pipeline

Alright, so we’re going to build something like this:

┌─────────────────────────────────────────────────────────────────────┐
│                     SUPERVISED FINE-TUNING (SFT)                    │
├─────────────────────────────────────────────────────────────────────┤
│ Teaching the model to follow instructions                           │
│                                                                     │
│ • Show it examples: "When asked X, respond with Y"                  │
│ • Use chat templates to format conversations                        │
│ • Only train on the responses (not the questions)                   │
│ • LoRA keeps this efficient (we'll explain later)                   │
│                                                                     │
│ Analogy: Teaching someone the *format* of good answers              │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      REWARD MODELING                                │
├─────────────────────────────────────────────────────────────────────┤
│ Teaching the model what "good" means                                │
│                                                                     │
│ • Show it pairs: "This answer is better than that one"              │
│ • Train it to score responses (higher = better)                     │
│ • Bradley-Terry loss (fancy ranking math)                           │
│ • Evaluation: does it rank things the way humans would?             │
│                                                                     │
│ Analogy: Teaching a judge to score gymnastics routines              │
└─────────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
┌─────────────────────────┐     ┌─────────────────────────┐
│         RLHF            │     │          DPO            │
├─────────────────────────┤     ├─────────────────────────┤
│ Two-stage approach      │     │ One-stage shortcut      │
│                         │     │                         │
│ • Train reward model    │     │ • Skip reward model     │
│ • Use it to train       │     │ • Optimize preferences  │
│   policy with PPO       │     │   directly              │
│ • Complex but powerful  │     │ • Simpler, faster       │
│ • Needs 4 models (!)    │     │ • Only needs 2 models   │
│                         │     │                         │
│ Classic method          │     │ Modern alternative      │
└─────────────────────────┘     └─────────────────────────┘

RLHF = Reinforcement Learning from Human Feedback
DPO = Direct Preference Optimization
PPO = Proximal Policy Optimization (the RL algorithm RLHF uses)

Module Organization

Our code is split into clean modules. Each one handles a different stage of the pipeline.

ModulePurposeKey Functions
sft/Supervised Fine-TuningSFTTrainer, format_instruction
reward/Reward Model TrainingRewardModel, RewardModelTrainer
rlhf/RLHF with PPOPPOTrainer, ValueNetwork, RolloutBuffer
dpo/Direct Preference OptimizationDPOTrainer, compute_dpo_loss
utils/Shared Utilitiesload_model_and_tokenizer, setup_device

Think of these as separate kitchens in a restaurant. Each one specializes in a different course of the meal.

Data Formats

Each training stage expects a different data format.

SFT (Supervised Fine-Tuning) Data

Simple input-output pairs. Question and answer. Instruction and response.

{
    "instruction": "What is the capital of France?",
    "response": "The capital of France is Paris."
}

The model learns “when you see this format of question, generate this format of answer.”

Preference Data (for Reward Model & DPO)

Instead of just one answer, we show the model two answers to the same prompt. One good, one bad.

{
    "prompt": "Explain quantum computing simply.",
    "chosen": "Imagine a coin spinning in the air. It's both heads and tails until it lands. Quantum computers work with information in that 'spinning' state, processing multiple possibilities simultaneously.",
    "rejected": "Quantum computers use qubits which leverage quantum superposition and entanglement to perform computations exponentially faster than classical computers by exploiting quantum mechanical phenomena."
}

The “chosen” response is simple, clear, uses an analogy. The “rejected” one is technically accurate but sounds like it swallowed a physics textbook.

The model learns: “When comparing these two, rank the first one higher.”

Prompt Data (for RLHF)

Once we have a reward model, we can just give it prompts and let it generate responses, then score them.

{
    "prompt": "Write a haiku about programming."
}

The model generates completions, the reward model scores them, and we use those scores to improve the policy.

# Let's make these concrete with actual Python data structures

# SFT data: instruction-response pairs
sft_data = [
    {
        "instruction": "What is Python?", 
        "response": "Python is a high-level programming language known for its readability and simplicity. It's great for beginners and powerful enough for experts."
    },
    {
        "instruction": "Translate 'hello' to French", 
        "response": "'Hello' in French is 'Bonjour'."
    }
]

# Preference data: one prompt, two competing responses
preference_data = [
    {
        "prompt": "Explain artificial intelligence briefly.",
        "chosen": "AI is technology that enables machines to simulate human intelligence. Learning from experience, recognizing patterns, and making decisions.",
        "rejected": "AI."  # Technically correct but useless
    }
]

# Let's look at what these actually contain
print("SFT Data Format:")
print(f"  Keys: {list(sft_data[0].keys())}")
print(f"  Example instruction: \"{sft_data[0]['instruction']}\"")
print()

print("Preference Data Format:")
print(f"  Keys: {list(preference_data[0].keys())}")
print(f"  Chosen response length: {len(preference_data[0]['chosen'])} chars")
print(f"  Rejected response length: {len(preference_data[0]['rejected'])} chars")
SFT Data Format:
  Keys: ['instruction', 'response']
  Example instruction: "What is Python?"

Preference Data Format:
  Keys: ['prompt', 'chosen', 'rejected']
  Chosen response length: 140 chars
  Rejected response length: 3 chars

Training Progression

Here’s the typical path from “raw autocomplete” to “helpful assistant”:

StepInput ModelOutput ModelWhat HappensTraining Time*
1. SFTBase (GPT-2)SFT ModelLearn to follow instructions~1 hour
2. RewardSFT ModelReward ModelLearn to judge quality~30 min
3a. RLHFSFT Model + RMRLHF ModelOptimize for high rewards~2 hours
3b. DPOSFT ModelDPO ModelOptimize preferences directly~1 hour

*Approximate times for GPT-2 (124M params) on a single GPU. Your mileage may vary.

Notice step 3 splits? That’s the fork in the road. You can either:

  • Go the RLHF route: Train a reward model first, then use reinforcement learning (PPO) to optimize your policy against it. More complex, more moving parts, but this is what OpenAI used for GPT-4.

  • Go the DPO route: Skip the reward model entirely and optimize preferences directly. Simpler, faster, and often just as good. This is more recent.

We’ll implement both so you understand the tradeoffs.

Key Hyperparameters

Hyperparameters are the dials you turn to make training work. Each stage has different sweet spots.

SFT (Supervised Fine-Tuning)

  • Learning rate: 2e-4 (that’s 0.0002)

    • Higher than pre-training! We’re making bigger updates because we’re teaching a new skill

    • But not too high or we’ll destroy what the model already knows

  • Batch size: 4-8

    • Small because we’re fine-tuning, not pre-training

    • Larger batches = more stable but more memory

  • Epochs: 3-5

    • A few passes through the data is usually enough

    • Too many and you overfit (model memorizes instead of generalizes)

Reward Model

  • Learning rate: 1e-5 (that’s 0.00001)

    • Much lower! Reward models are sensitive

    • We want careful, stable learning of the preference ranking

  • Batch size: 4 (but each sample has 2 sequences)

    • We’re comparing pairs, so effective batch size is 8 sequences

  • Epochs: 1

    • Just one pass! Reward models overfit easily

    • If you train too long, they memorize specific preferences instead of learning general quality

RLHF (with PPO)

  • Learning rate: 1e-6 (that’s 0.000001)

    • Tiny! RL is unstable, we need baby steps

    • Too high and training collapses (you’ll see divergence, mode collapse, gibberish)

  • KL coefficient: 0.1

    • This keeps the model close to the original SFT model

    • Prevents it from going off the rails chasing reward

  • PPO epochs: 4

    • How many times we update on each batch of rollouts

    • Classic PPO sweet spot

DPO (Direct Preference Optimization)

  • Learning rate: 1e-6

    • Same as RLHF. We’re doing preference learning, gotta be gentle

  • Beta (β): 0.1

    • Controls how strongly we optimize preferences

    • Higher = more aggressive, lower = more conservative

  • Epochs: 1-3

    • DPO is more stable than PPO, can train a bit longer

    • But still, don’t overdo it

# Here are those configurations in code
# (so you can see them all in one place)

sft_config = {
    "learning_rate": 2e-4,      # 0.0002 - higher for teaching new skills
    "batch_size": 4,
    "num_epochs": 3,
    "max_length": 512,          # truncate long sequences here
    "warmup_steps": 100,        # gradually increase LR at start
}

reward_config = {
    "learning_rate": 1e-5,      # 0.00001 - much lower, reward models are sensitive
    "batch_size": 4,            # but remember: 2 sequences per sample!
    "num_epochs": 1,            # just one pass to avoid overfitting
    "max_length": 512,
}

ppo_config = {
    "learning_rate": 1e-6,      # 0.000001 - tiny! RL is unstable
    "batch_size": 4,
    "ppo_epochs": 4,            # how many times to update per rollout batch
    "kl_coef": 0.1,             # keeps us close to reference model
    "clip_ratio": 0.2,          # PPO clipping (prevents huge updates)
}

dpo_config = {
    "learning_rate": 1e-6,      # same as PPO
    "batch_size": 4,
    "num_epochs": 1,            # conservative - can go up to 3 if needed
    "beta": 0.1,                # preference optimization strength
}

print("Configuration summary:")
print(f"  SFT learning rate:    {sft_config['learning_rate']:.6f}")
print(f"  Reward learning rate: {reward_config['learning_rate']:.6f}")
print(f"  PPO learning rate:    {ppo_config['learning_rate']:.6f}")
print(f"  DPO learning rate:    {dpo_config['learning_rate']:.6f}")
print()
print("Notice the pattern? Learning rates get smaller as training gets trickier.")
Configuration summary:
  SFT learning rate:    0.000200
  Reward learning rate: 0.000010
  PPO learning rate:    0.000001
  DPO learning rate:    0.000001

Notice the pattern? Learning rates get smaller as training gets trickier.

Memory Considerations

Post-training is expensive. In both money and memory.

StageModels in MemoryMemory FactorWhat’s Loaded
SFT1 model1xJust the model we’re training
Reward1 model1xJust the reward model (but 2 sequences/batch)
RLHF4 models4xPolicy, value net, reward model, reference model
DPO2 models2xPolicy, reference model

RLHF requires four models in memory at once:

  1. Policy model - the one we’re actually training

  2. Value network - estimates expected future reward (RL thing)

  3. Reward model - scores our generations

  4. Reference model - the original SFT model we’re trying not to drift too far from

DPO cuts this in half by skipping the reward model and value network. Just policy + reference.

How to fit this on a single GPU?

We’ve got tricks:

  • LoRA (Low-Rank Adaptation)

    • Instead of updating all parameters, we add small trainable adapters

    • Massively reduces memory for gradients and optimizer states

    • Like teaching someone by giving them a cheat sheet instead of rewriting their brain

  • Gradient checkpointing

    • Trade computation for memory

    • Recompute activations during backward pass instead of storing them

    • Slower but fits in VRAM

  • Mixed precision (fp16/bf16)

    • Use 16-bit floats instead of 32-bit

    • Cuts memory in half (roughly)

    • Modern GPUs are built for this

  • Gradient accumulation

    • Simulate larger batches by accumulating gradients over multiple steps

    • Doesn’t reduce peak memory but improves training stability

Without these tricks? You’d need a data center. With them? You can sort of do something at GPT-2 scale.

Next Steps

You’ve got the bird’s eye view of the entire pipeline.

We’re going from base model → instruction-following → preference-aligned. Three stages (or four, if you count the RLHF/DPO fork).

Next up: Supervised Fine-Tuning (SFT). We’ll teach our model to follow instructions, format responses properly, and actually be useful.

It’s the foundation everything else builds on.