Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Project Overview

The Complete Pipeline

Alright, let’s talk about what happens after you train a base model.

You’ve got a language model that can complete text. Great! But it doesn’t follow instructions. It doesn’t know what you actually want when you ask it a question. It’s like a really smart autocomplete machine that just continues whatever pattern you started.

So how do we go from “autocomplete machine” to “helpful assistant”?

Three stages. Each builds on the last:

┌─────────────────────────────────────────────────────────────────────┐
│                     SUPERVISED FINE-TUNING (SFT)                    │
├─────────────────────────────────────────────────────────────────────┤
│ Teaching the model to follow instructions                           │
│                                                                      │
│ • Show it examples: "When asked X, respond with Y"                  │
│ • Use chat templates to format conversations                        │
│ • Only train on the responses (not the questions)                   │
│ • LoRA keeps this efficient (we'll explain later)                   │
│                                                                      │
│ Analogy: Teaching someone the *format* of good answers              │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      REWARD MODELING                                │
├─────────────────────────────────────────────────────────────────────┤
│ Teaching the model what "good" means                                │
│                                                                      │
│ • Show it pairs: "This answer is better than that one"              │
│ • Train it to score responses (higher = better)                     │
│ • Bradley-Terry loss (fancy ranking math)                           │
│ • Evaluation: does it rank things the way humans would?             │
│                                                                      │
│ Analogy: Teaching a judge to score gymnastics routines              │
└─────────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
┌─────────────────────────┐     ┌─────────────────────────┐
│         RLHF            │     │          DPO            │
├─────────────────────────┤     ├─────────────────────────┤
│ Two-stage approach      │     │ One-stage shortcut      │
│                         │     │                         │
│ • Train reward model    │     │ • Skip reward model     │
│ • Use it to train       │     │ • Optimize preferences  │
│   policy with PPO       │     │   directly              │
│ • Complex but powerful  │     │ • Simpler, faster       │
│ • Needs 4 models (!)    │     │ • Only needs 2 models   │
│                         │     │                         │
│ Classic method          │     │ Modern alternative      │
└─────────────────────────┘     └─────────────────────────┘

RLHF = Reinforcement Learning from Human Feedback
DPO = Direct Preference Optimization
PPO = Proximal Policy Optimization (the RL algorithm RLHF uses)

The end goal? A model that doesn’t just follow instructions, but follows them well. Helpful, harmless, and honest (as the saying goes).

Module Organization

Our code is split into clean modules. Each one handles a different stage of the pipeline.

ModulePurposeKey Functions
sft/Supervised Fine-TuningSFTTrainer, format_instruction
reward/Reward Model TrainingRewardModel, RewardModelTrainer
rlhf/RLHF with PPOPPOTrainer, ValueNetwork, RolloutBuffer
dpo/Direct Preference OptimizationDPOTrainer, compute_dpo_loss
utils/Shared Utilitiesload_model_and_tokenizer, setup_device

Think of these as separate kitchens in a restaurant. Each one specializes in a different course of the meal. You don’t make dessert where you’re grilling steaks (though I suppose you could...probably shouldn’t).

Data Formats

With training pipelines, each stage speaks a different language. Not literally, but in terms of what data format it expects.

Let’s break them down.

SFT (Supervised Fine-Tuning) Data

Simple input-output pairs. Question and answer. Instruction and response.

{
    "instruction": "What is the capital of France?",
    "response": "The capital of France is Paris."
}

Dead simple. The model learns “when you see this format of question, generate this format of answer.”

Preference Data (for Reward Model & DPO)

Now it gets spicy. Instead of just one answer, we show the model two answers to the same prompt. One good, one bad.

{
    "prompt": "Explain quantum computing simply.",
    "chosen": "Imagine a coin spinning in the air—it's both heads and tails until it lands. Quantum computers work with information in that 'spinning' state, processing multiple possibilities simultaneously.",
    "rejected": "Quantum computers use qubits which leverage quantum superposition and entanglement to perform computations exponentially faster than classical computers by exploiting quantum mechanical phenomena."
}

See the difference? The “chosen” response is simple, clear, uses an analogy. The “rejected” one? Technically accurate but sounds like it swallowed a physics textbook.

The model learns: “When comparing these two, rank the first one higher.”

Prompt Data (for RLHF)

Once we have a reward model, we can just give it prompts and let it generate responses, then score them.

{
    "prompt": "Write a haiku about programming."
}

The model generates completions, the reward model scores them, and we use those scores to improve the policy. Rinse and repeat.

(We’ll see this in action much later.)

# Let's make these concrete with actual Python data structures

# SFT data: instruction-response pairs
sft_data = [
    {
        "instruction": "What is Python?", 
        "response": "Python is a high-level programming language known for its readability and simplicity. It's great for beginners and powerful enough for experts."
    },
    {
        "instruction": "Translate 'hello' to French", 
        "response": "'Hello' in French is 'Bonjour'."
    }
]

# Preference data: one prompt, two competing responses
preference_data = [
    {
        "prompt": "Explain artificial intelligence briefly.",
        "chosen": "AI is technology that enables machines to simulate human intelligence—learning from experience, recognizing patterns, and making decisions.",
        "rejected": "AI."  # Technically correct but...useless
    }
]

# Let's look at what these actually contain
print("SFT Data Format:")
print(f"  Keys: {list(sft_data[0].keys())}")
print(f"  Example instruction: \"{sft_data[0]['instruction']}\"")
print()

print("Preference Data Format:")
print(f"  Keys: {list(preference_data[0].keys())}")
print(f"  Chosen response length: {len(preference_data[0]['chosen'])} chars")
print(f"  Rejected response length: {len(preference_data[0]['rejected'])} chars")
print()
print("(Notice how the rejected response is way shorter? Sometimes bad answers are just...lazy.)")
SFT Data Format:
  Keys: ['instruction', 'response']
  Example instruction: "What is Python?"

Preference Data Format:
  Keys: ['prompt', 'chosen', 'rejected']
  Chosen response length: 139 chars
  Rejected response length: 3 chars

(Notice how the rejected response is way shorter? Sometimes bad answers are just...lazy.)

Training Progression

So you’ve got a base model. Now what?

Here’s the typical path from “raw autocomplete” to “helpful assistant”:

StepInput ModelOutput ModelWhat HappensTraining Time*
1. SFTBase (GPT-2)SFT ModelLearn to follow instructions~1 hour
2. RewardSFT ModelReward ModelLearn to judge quality~30 min
3a. RLHFSFT Model + RMRLHF ModelOptimize for high rewards~2 hours
3b. DPOSFT ModelDPO ModelOptimize preferences directly~1 hour

*Approximate times for GPT-2 (124M params) on a single GPU. Your mileage may vary.

Notice step 3 splits? That’s the fork in the road. You can either:

  • Go the RLHF route: Train a reward model first, then use reinforcement learning (PPO) to optimize your policy against it. More complex, more moving parts, but this is what OpenAI used for GPT-4.

  • Go the DPO route: Skip the reward model entirely and optimize preferences directly. Simpler, faster, and honestly? Often just as good. This is the new hotness.

We’ll implement both so you understand the tradeoffs. (Because understanding > blindly following trends.)

Key Hyperparameters

Hyperparameters are the dials you turn to make training work. Each stage has different sweet spots.

Let me explain the reasoning behind these numbers (instead of just throwing them at you).

SFT (Supervised Fine-Tuning)

  • Learning rate: 2e-4 (that’s 0.0002)

    • Higher than pre-training! We’re making bigger updates because we’re teaching a new skill

    • But not too high or we’ll destroy what the model already knows

  • Batch size: 4-8

    • Small because we’re fine-tuning, not pre-training

    • Larger batches = more stable but more memory

  • Epochs: 3-5

    • A few passes through the data is usually enough

    • Too many and you overfit (model memorizes instead of generalizes)

Reward Model

  • Learning rate: 1e-5 (that’s 0.00001)

    • Much lower! Reward models are sensitive

    • We want careful, stable learning of the preference ranking

  • Batch size: 4 (but each sample has 2 sequences)

    • We’re comparing pairs, so effective batch size is 8 sequences

  • Epochs: 1

    • Just one pass! Reward models overfit easily

    • If you train too long, they memorize specific preferences instead of learning general quality

RLHF (with PPO)

  • Learning rate: 1e-6 (that’s 0.000001)

    • Tiny! RL is unstable, we need baby steps

    • Too high and training collapses (you’ll see divergence, mode collapse, gibberish)

  • KL coefficient: 0.1

    • This keeps the model close to the original SFT model

    • Prevents it from going off the rails chasing reward

  • PPO epochs: 4

    • How many times we update on each batch of rollouts

    • Classic PPO sweet spot

DPO (Direct Preference Optimization)

  • Learning rate: 1e-6

    • Same as RLHF—we’re doing preference learning, gotta be gentle

  • Beta (β): 0.1

    • Controls how strongly we optimize preferences

    • Higher = more aggressive, lower = more conservative

  • Epochs: 1-3

    • DPO is more stable than PPO, can train a bit longer

    • But still, don’t overdo it

The pattern? As we get further from standard supervised learning, we get more conservative. RL is temperamental.

# Here are those configurations in code
# (so you can see them all in one place)

sft_config = {
    "learning_rate": 2e-4,      # 0.0002 - higher for teaching new skills
    "batch_size": 4,
    "num_epochs": 3,
    "max_length": 512,          # truncate long sequences here
    "warmup_steps": 100,        # gradually increase LR at start
}

reward_config = {
    "learning_rate": 1e-5,      # 0.00001 - much lower, reward models are sensitive
    "batch_size": 4,            # but remember: 2 sequences per sample!
    "num_epochs": 1,            # just one pass to avoid overfitting
    "max_length": 512,
}

ppo_config = {
    "learning_rate": 1e-6,      # 0.000001 - tiny! RL is unstable
    "batch_size": 4,
    "ppo_epochs": 4,            # how many times to update per rollout batch
    "kl_coef": 0.1,             # keeps us close to reference model
    "clip_ratio": 0.2,          # PPO clipping (prevents huge updates)
}

dpo_config = {
    "learning_rate": 1e-6,      # same as PPO
    "batch_size": 4,
    "num_epochs": 1,            # conservative - can go up to 3 if needed
    "beta": 0.1,                # preference optimization strength
}

print("Configuration summary:")
print(f"  SFT learning rate:    {sft_config['learning_rate']:.6f}")
print(f"  Reward learning rate: {reward_config['learning_rate']:.6f}")
print(f"  PPO learning rate:    {ppo_config['learning_rate']:.6f}")
print(f"  DPO learning rate:    {dpo_config['learning_rate']:.6f}")
print()
print("Notice the pattern? Learning rates get smaller as training gets trickier.")
Configuration summary:
  SFT learning rate:    0.000200
  Reward learning rate: 0.000010
  PPO learning rate:    0.000001
  DPO learning rate:    0.000001

Notice the pattern? Learning rates get smaller as training gets trickier.

Memory Considerations

Here’s the dirty secret about post-training: it’s expensive.

Not money expensive (well, also that), but memory expensive. Let me break down why.

StageModels in MemoryMemory FactorWhat’s Loaded
SFT1 model1xJust the model we’re training
Reward1 model1xJust the reward model (but 2 sequences/batch)
RLHF4 models4xPolicy, value net, reward model, reference model
DPO2 models2xPolicy, reference model

See why RLHF is so painful? Four models in memory at once:

  1. Policy model - the one we’re actually training

  2. Value network - estimates expected future reward (RL thing)

  3. Reward model - scores our generations

  4. Reference model - the original SFT model we’re trying not to drift too far from

DPO cuts this in half by skipping the reward model and value network. Just policy + reference.

How to fit this on a single GPU?

We’ve got tricks:

  • LoRA (Low-Rank Adaptation)

    • Instead of updating all parameters, we add small trainable adapters

    • Massively reduces memory for gradients and optimizer states

    • Like teaching someone by giving them a cheat sheet instead of rewriting their brain

  • Gradient checkpointing

    • Trade computation for memory

    • Recompute activations during backward pass instead of storing them

    • Slower but fits in VRAM

  • Mixed precision (fp16/bf16)

    • Use 16-bit floats instead of 32-bit

    • Cuts memory in half (roughly)

    • Modern GPUs are built for this

  • Gradient accumulation

    • Simulate larger batches by accumulating gradients over multiple steps

    • Doesn’t reduce peak memory but improves training stability

    • “I can’t lift 100 pounds at once, but I can make four trips with 25 pounds each”

Without these tricks? You’d need a data center. With them? You can do this on a consumer GPU.

(Well, for GPT-2 scale. If you want to fine-tune Llama-70B...get your credit card ready.)

Next Steps

Alright, you’ve got the bird’s eye view of the entire pipeline.

We’re going from base model → instruction-following → preference-aligned. Three stages (or four, if you count the RLHF/DPO fork).

Now let’s get our hands dirty.

Next up: Supervised Fine-Tuning (SFT). We’ll teach our model to follow instructions, format responses properly, and actually be useful.

It’s the foundation everything else builds on. Get this right, and the rest flows naturally. Get it wrong, and...well, you’ll be debugging reward models for a week.

Let’s go.