Project Overview - An Introduction to Transformers

The Complete Pipeline¶

Alright, let’s talk about what happens after you train a base model.

You’ve got a language model that can complete text. Great! But it doesn’t follow instructions. It doesn’t know what you actually want when you ask it a question. It’s like a really smart autocomplete machine that just continues whatever pattern you started.

So how do we go from “autocomplete machine” to “helpful assistant”?

Three stages. Each builds on the last:

┌─────────────────────────────────────────────────────────────────────┐
│                     SUPERVISED FINE-TUNING (SFT)                    │
├─────────────────────────────────────────────────────────────────────┤
│ Teaching the model to follow instructions                           │
│                                                                      │
│ • Show it examples: "When asked X, respond with Y"                  │
│ • Use chat templates to format conversations                        │
│ • Only train on the responses (not the questions)                   │
│ • LoRA keeps this efficient (we'll explain later)                   │
│                                                                      │
│ Analogy: Teaching someone the *format* of good answers              │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                      REWARD MODELING                                │
├─────────────────────────────────────────────────────────────────────┤
│ Teaching the model what "good" means                                │
│                                                                      │
│ • Show it pairs: "This answer is better than that one"              │
│ • Train it to score responses (higher = better)                     │
│ • Bradley-Terry loss (fancy ranking math)                           │
│ • Evaluation: does it rank things the way humans would?             │
│                                                                      │
│ Analogy: Teaching a judge to score gymnastics routines              │
└─────────────────────────────────────────────────────────────────────┘
                              │
              ┌───────────────┴───────────────┐
              ▼                               ▼
┌─────────────────────────┐     ┌─────────────────────────┐
│         RLHF            │     │          DPO            │
├─────────────────────────┤     ├─────────────────────────┤
│ Two-stage approach      │     │ One-stage shortcut      │
│                         │     │                         │
│ • Train reward model    │     │ • Skip reward model     │
│ • Use it to train       │     │ • Optimize preferences  │
│   policy with PPO       │     │   directly              │
│ • Complex but powerful  │     │ • Simpler, faster       │
│ • Needs 4 models (!)    │     │ • Only needs 2 models   │
│                         │     │                         │
│ Classic method          │     │ Modern alternative      │
└─────────────────────────┘     └─────────────────────────┘

RLHF = Reinforcement Learning from Human Feedback
DPO = Direct Preference Optimization
PPO = Proximal Policy Optimization (the RL algorithm RLHF uses)

The end goal? A model that doesn’t just follow instructions, but follows them well. Helpful, harmless, and honest (as the saying goes).

Module Organization¶

Our code is split into clean modules. Each one handles a different stage of the pipeline.

Module	Purpose	Key Functions
`sft/`	Supervised Fine-Tuning	`SFTTrainer`, `format_instruction`
`reward/`	Reward Model Training	`RewardModel`, `RewardModelTrainer`
`rlhf/`	RLHF with PPO	`PPOTrainer`, `ValueNetwork`, `RolloutBuffer`
`dpo/`	Direct Preference Optimization	`DPOTrainer`, `compute_dpo_loss`
`utils/`	Shared Utilities	`load_model_and_tokenizer`, `setup_device`

Think of these as separate kitchens in a restaurant. Each one specializes in a different course of the meal. You don’t make dessert where you’re grilling steaks (though I suppose you could...probably shouldn’t).

Data Formats¶

With training pipelines, each stage speaks a different language. Not literally, but in terms of what data format it expects.

Let’s break them down.

SFT (Supervised Fine-Tuning) Data¶

Simple input-output pairs. Question and answer. Instruction and response.

{
    "instruction": "What is the capital of France?",
    "response": "The capital of France is Paris."
}

Dead simple. The model learns “when you see this format of question, generate this format of answer.”

Preference Data (for Reward Model & DPO)¶

Now it gets spicy. Instead of just one answer, we show the model two answers to the same prompt. One good, one bad.

{
    "prompt": "Explain quantum computing simply.",
    "chosen": "Imagine a coin spinning in the air—it's both heads and tails until it lands. Quantum computers work with information in that 'spinning' state, processing multiple possibilities simultaneously.",
    "rejected": "Quantum computers use qubits which leverage quantum superposition and entanglement to perform computations exponentially faster than classical computers by exploiting quantum mechanical phenomena."
}

See the difference? The “chosen” response is simple, clear, uses an analogy. The “rejected” one? Technically accurate but sounds like it swallowed a physics textbook.

The model learns: “When comparing these two, rank the first one higher.”

Prompt Data (for RLHF)¶

Once we have a reward model, we can just give it prompts and let it generate responses, then score them.

{
    "prompt": "Write a haiku about programming."
}

The model generates completions, the reward model scores them, and we use those scores to improve the policy. Rinse and repeat.

(We’ll see this in action much later.)

# Let's make these concrete with actual Python data structures

# SFT data: instruction-response pairs
sft_data = [
    {
        "instruction": "What is Python?", 
        "response": "Python is a high-level programming language known for its readability and simplicity. It's great for beginners and powerful enough for experts."
    },
    {
        "instruction": "Translate 'hello' to French", 
        "response": "'Hello' in French is 'Bonjour'."
    }
]

# Preference data: one prompt, two competing responses
preference_data = [
    {
        "prompt": "Explain artificial intelligence briefly.",
        "chosen": "AI is technology that enables machines to simulate human intelligence—learning from experience, recognizing patterns, and making decisions.",
        "rejected": "AI."  # Technically correct but...useless
    }
]

# Let's look at what these actually contain
print("SFT Data Format:")
print(f"  Keys: {list(sft_data[0].keys())}")
print(f"  Example instruction: \"{sft_data[0]['instruction']}\"")
print()

print("Preference Data Format:")
print(f"  Keys: {list(preference_data[0].keys())}")
print(f"  Chosen response length: {len(preference_data[0]['chosen'])} chars")
print(f"  Rejected response length: {len(preference_data[0]['rejected'])} chars")
print()
print("(Notice how the rejected response is way shorter? Sometimes bad answers are just...lazy.)")

SFT Data Format:
  Keys: ['instruction', 'response']
  Example instruction: "What is Python?"

Preference Data Format:
  Keys: ['prompt', 'chosen', 'rejected']
  Chosen response length: 139 chars
  Rejected response length: 3 chars

(Notice how the rejected response is way shorter? Sometimes bad answers are just...lazy.)

Training Progression¶

So you’ve got a base model. Now what?

Here’s the typical path from “raw autocomplete” to “helpful assistant”:

Step	Input Model	Output Model	What Happens	Training Time*
1. SFT	Base (GPT-2)	SFT Model	Learn to follow instructions	~1 hour
2. Reward	SFT Model	Reward Model	Learn to judge quality	~30 min
3a. RLHF	SFT Model + RM	RLHF Model	Optimize for high rewards	~2 hours
3b. DPO	SFT Model	DPO Model	Optimize preferences directly	~1 hour

*Approximate times for GPT-2 (124M params) on a single GPU. Your mileage may vary.

Notice step 3 splits? That’s the fork in the road. You can either:

Go the RLHF route: Train a reward model first, then use reinforcement learning (PPO) to optimize your policy against it. More complex, more moving parts, but this is what OpenAI used for GPT-4.
Go the DPO route: Skip the reward model entirely and optimize preferences directly. Simpler, faster, and honestly? Often just as good. This is the new hotness.

We’ll implement both so you understand the tradeoffs. (Because understanding > blindly following trends.)

Key Hyperparameters¶

Hyperparameters are the dials you turn to make training work. Each stage has different sweet spots.

Let me explain the reasoning behind these numbers (instead of just throwing them at you).

SFT (Supervised Fine-Tuning)¶

Learning rate: 2e-4 (that’s 0.0002)
- Higher than pre-training! We’re making bigger updates because we’re teaching a new skill
- But not too high or we’ll destroy what the model already knows
Batch size: 4-8
- Small because we’re fine-tuning, not pre-training
- Larger batches = more stable but more memory
Epochs: 3-5
- A few passes through the data is usually enough
- Too many and you overfit (model memorizes instead of generalizes)

Reward Model¶

Learning rate: 1e-5 (that’s 0.00001)
- Much lower! Reward models are sensitive
- We want careful, stable learning of the preference ranking
Batch size: 4 (but each sample has 2 sequences)
- We’re comparing pairs, so effective batch size is 8 sequences
Epochs: 1
- Just one pass! Reward models overfit easily
- If you train too long, they memorize specific preferences instead of learning general quality

RLHF (with PPO)¶

Learning rate: 1e-6 (that’s 0.000001)
- Tiny! RL is unstable, we need baby steps
- Too high and training collapses (you’ll see divergence, mode collapse, gibberish)
KL coefficient: 0.1
- This keeps the model close to the original SFT model
- Prevents it from going off the rails chasing reward
PPO epochs: 4
- How many times we update on each batch of rollouts
- Classic PPO sweet spot

DPO (Direct Preference Optimization)¶

Learning rate: 1e-6
- Same as RLHF—we’re doing preference learning, gotta be gentle
Beta (β): 0.1
- Controls how strongly we optimize preferences
- Higher = more aggressive, lower = more conservative
Epochs: 1-3
- DPO is more stable than PPO, can train a bit longer
- But still, don’t overdo it

The pattern? As we get further from standard supervised learning, we get more conservative. RL is temperamental.

# Here are those configurations in code
# (so you can see them all in one place)

sft_config = {
    "learning_rate": 2e-4,      # 0.0002 - higher for teaching new skills
    "batch_size": 4,
    "num_epochs": 3,
    "max_length": 512,          # truncate long sequences here
    "warmup_steps": 100,        # gradually increase LR at start
}

reward_config = {
    "learning_rate": 1e-5,      # 0.00001 - much lower, reward models are sensitive
    "batch_size": 4,            # but remember: 2 sequences per sample!
    "num_epochs": 1,            # just one pass to avoid overfitting
    "max_length": 512,
}

ppo_config = {
    "learning_rate": 1e-6,      # 0.000001 - tiny! RL is unstable
    "batch_size": 4,
    "ppo_epochs": 4,            # how many times to update per rollout batch
    "kl_coef": 0.1,             # keeps us close to reference model
    "clip_ratio": 0.2,          # PPO clipping (prevents huge updates)
}

dpo_config = {
    "learning_rate": 1e-6,      # same as PPO
    "batch_size": 4,
    "num_epochs": 1,            # conservative - can go up to 3 if needed
    "beta": 0.1,                # preference optimization strength
}

print("Configuration summary:")
print(f"  SFT learning rate:    {sft_config['learning_rate']:.6f}")
print(f"  Reward learning rate: {reward_config['learning_rate']:.6f}")
print(f"  PPO learning rate:    {ppo_config['learning_rate']:.6f}")
print(f"  DPO learning rate:    {dpo_config['learning_rate']:.6f}")
print()
print("Notice the pattern? Learning rates get smaller as training gets trickier.")

Configuration summary:
  SFT learning rate:    0.000200
  Reward learning rate: 0.000010
  PPO learning rate:    0.000001
  DPO learning rate:    0.000001

Notice the pattern? Learning rates get smaller as training gets trickier.

Memory Considerations¶

Here’s the dirty secret about post-training: it’s expensive.

Not money expensive (well, also that), but memory expensive. Let me break down why.

Stage	Models in Memory	Memory Factor	What’s Loaded
SFT	1 model	1x	Just the model we’re training
Reward	1 model	1x	Just the reward model (but 2 sequences/batch)
RLHF	4 models	4x	Policy, value net, reward model, reference model
DPO	2 models	2x	Policy, reference model

See why RLHF is so painful? Four models in memory at once:

Policy model - the one we’re actually training
Value network - estimates expected future reward (RL thing)
Reward model - scores our generations
Reference model - the original SFT model we’re trying not to drift too far from

DPO cuts this in half by skipping the reward model and value network. Just policy + reference.

How to fit this on a single GPU?¶

We’ve got tricks:

LoRA (Low-Rank Adaptation)
- Instead of updating all parameters, we add small trainable adapters
- Massively reduces memory for gradients and optimizer states
- Like teaching someone by giving them a cheat sheet instead of rewriting their brain
Gradient checkpointing
- Trade computation for memory
- Recompute activations during backward pass instead of storing them
- Slower but fits in VRAM
Mixed precision (fp16/bf16)
- Use 16-bit floats instead of 32-bit
- Cuts memory in half (roughly)
- Modern GPUs are built for this
Gradient accumulation
- Simulate larger batches by accumulating gradients over multiple steps
- Doesn’t reduce peak memory but improves training stability
- “I can’t lift 100 pounds at once, but I can make four trips with 25 pounds each”

Without these tricks? You’d need a data center. With them? You can do this on a consumer GPU.

(Well, for GPT-2 scale. If you want to fine-tune Llama-70B...get your credit card ready.)

Next Steps¶

Alright, you’ve got the bird’s eye view of the entire pipeline.

We’re going from base model → instruction-following → preference-aligned. Three stages (or four, if you count the RLHF/DPO fork).

Now let’s get our hands dirty.

Next up: Supervised Fine-Tuning (SFT). We’ll teach our model to follow instructions, format responses properly, and actually be useful.

It’s the foundation everything else builds on. Get this right, and the rest flows naturally. Get it wrong, and...well, you’ll be debugging reward models for a week.

Let’s go.