When Everything Goes Wrong - An Introduction to Transformers

You’re going to break things. We all do.

The question is: can you fix them quickly, or will you spend three days hunting a bug that turns out to be a single misplaced -100?

(I’ve done the latter. Multiple times. Let’s save you from that.)

The Pattern¶

Here’s what always happens:

You start training. Everything looks fine. Loss is going down. You grab coffee.

You come back. Loss is nan. Or infinity. Or stuck at 2.45 for 800 steps. Or worse — it’s going down beautifully, but your model now thinks Paris is the capital of diabetes.

This notebook is your debugging playbook. Each pitfall follows a pattern:

The Story: What went wrong (because context matters)
The Symptoms: How to recognize it’s happening
The Cause: Why it actually happens
The Fix: What to do about it

Ready? Let’s break some models.

(And then fix them.)

Pitfall 1: The NaN Death Spiral¶

The Story:

It’s 2am. You’ve been training for three hours. Loss started at 2.5, dropped to 1.2, everything’s beautiful.

Then:

Step 1840: Loss = 1.18
Step 1841: Loss = 1.15
Step 1842: Loss = 3.47   <- uh oh
Step 1843: Loss = inf    <- UH OH
Step 1844: Loss = nan    <- dead
Step 1845: Loss = nan    <- still dead

Your model is toast. Can’t recover. Have to restart from the last checkpoint.

(If you saved checkpoints. You did save checkpoints, right?)

What Happened:

Something caused a gradient to explode. Maybe one batch had some weird tokens. Maybe the learning rate was too aggressive. Maybe you’re using FP16 and hit numerical limits.

Doesn’t matter. Once you get a NaN gradient, it infects everything it touches. Like a zombie virus for tensors.

How to Spot It:

The pattern is always the same: loss starts normal, maybe even improving, then suddenly jumps to infinity, then NaN. Sometimes you get warning signs (loss spiking but recovering), sometimes it just dies.

import torch
import torch.nn as nn

# Here's how to check for NaN gradients before they kill your training
def check_for_nan_gradients(model):
    """Find which parameters have NaN gradients (if any)."""
    has_nan = False
    nan_params = []
    
    for name, param in model.named_parameters():
        if param.grad is not None:
            if torch.isnan(param.grad).any():
                nan_params.append(name)
                has_nan = True
    
    return has_nan, nan_params

# Let's demonstrate this with a simple model
class SimpleModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.linear1 = nn.Linear(10, 5)
        self.linear2 = nn.Linear(5, 1)
    
    def forward(self, x):
        return self.linear2(torch.relu(self.linear1(x)))

model = SimpleModel()

print("Testing NaN Detection")
print("=" * 60)

# Test 1: Normal healthy gradients
x = torch.randn(4, 10)
y = model(x)
y.sum().backward()

has_nan, nan_params = check_for_nan_gradients(model)
print(f"\nHealthy gradients:")
print(f"  Has NaN? {has_nan}")
print(f"  Which params? {nan_params if nan_params else 'None - all good!'}")

# Test 2: Now let's inject a NaN and see it get caught
model.zero_grad()
y = model(x)
y.sum().backward()
model.linear1.weight.grad[0, 0] = float('nan')  # Simulate NaN

has_nan, nan_params = check_for_nan_gradients(model)
print(f"\nAfter injecting NaN:")
print(f"  Has NaN? {has_nan}")
print(f"  Which params? {nan_params}")
print(f"  ^ This is what you'd see right before your training dies")

Testing NaN Detection
============================================================

Healthy gradients:
  Has NaN? False
  Which params? None - all good!

After injecting NaN:
  Has NaN? True
  Which params? ['linear1.weight']
  ^ This is what you'd see right before your training dies

How to Fix NaN Loss:

Reduce learning rate (try 10x smaller)
Add gradient clipping: max_grad_norm=1.0
Switch from FP16 to BF16 (more stable)
Add warmup (gradual LR increase)

Important: Once you hit NaN, you MUST restart from the last checkpoint. NaN is terminal. No recovery.

(This is why you checkpoint frequently.)

Pitfall 2: The Frozen Model Mystery¶

The Story:

Your training loop runs. No errors. Loss is being logged. Everything looks fine.

Except... the loss isn’t moving. At all.

Step 100: Loss = 2.4532
Step 200: Loss = 2.4531
Step 300: Loss = 2.4529
Step 400: Loss = 2.4528

That’s not learning. That’s rounding error.

You check your learning rate: 1e-4. Seems fine.
You check your data: looks good.
You check your sanity: questionable, but unrelated.

Then you finally check: sum(p.numel() for p in model.parameters() if p.requires_grad)

Returns: 0

Oh.

What Happened:

Somewhere in your setup, you froze the model. Maybe you loaded a pretrained model and forgot to unfreeze it. Maybe you disabled gradients for inference and never re-enabled them. Maybe you applied LoRA but something went wrong.

Doesn’t matter. If requires_grad=False for all parameters, you’re not training anything. You’re just... running a very expensive random number generator.

How to Spot It:

Loss that barely moves (or moves identically every epoch). Model outputs that never change. That sinking feeling when you realize you’ve been “training” for six hours.

def verify_training_setup(model, optimizer):
    """Check if your model is actually set up to train."""
    issues = []
    
    # Count trainable vs frozen parameters
    trainable = sum(p.numel() for p in model.parameters() if p.requires_grad)
    total = sum(p.numel() for p in model.parameters())
    
    if trainable == 0:
        issues.append("CRITICAL: No trainable parameters! Model is completely frozen.")
    
    # Check optimizer configuration
    if len(optimizer.param_groups) == 0:
        issues.append("CRITICAL: Optimizer has no parameter groups!")
    else:
        lr = optimizer.param_groups[0]['lr']
        if lr < 1e-6:
            issues.append(f"WARNING: Learning rate very low: {lr}")
        if lr > 1e-2:
            issues.append(f"WARNING: Learning rate very high: {lr} (may cause NaN)")
    
    return {
        'trainable_params': trainable,
        'total_params': total,
        'trainable_pct': 100 * trainable / total if total > 0 else 0,
        'learning_rate': optimizer.param_groups[0]['lr'] if optimizer.param_groups else None,
        'issues': issues,
        'ok': len(issues) == 0
    }

print("Training Setup Verification")
print("=" * 60)

# Good setup: model is trainable
model = SimpleModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)

result = verify_training_setup(model, optimizer)
print(f"\nSetup 1: Normal configuration")
print(f"  Trainable params: {result['trainable_params']:,} ({result['trainable_pct']:.1f}%)")
print(f"  Learning rate: {result['learning_rate']}")
print(f"  Status: {'✓ Good to go!' if result['ok'] else 'Problems detected'}")
if result['issues']:
    for issue in result['issues']:
        print(f"    - {issue}")

# Bad setup: accidentally froze everything
frozen_model = SimpleModel()
for param in frozen_model.parameters():
    param.requires_grad = False

result = verify_training_setup(frozen_model, optimizer)
print(f"\nSetup 2: Frozen model (common mistake)")
print(f"  Trainable params: {result['trainable_params']:,} ({result['trainable_pct']:.1f}%)")
print(f"  Status: {'✓ Good to go!' if result['ok'] else '✗ Problems detected'}")
if result['issues']:
    for issue in result['issues']:
        print(f"    - {issue}")

Training Setup Verification
============================================================

Setup 1: Normal configuration
  Trainable params: 61 (100.0%)
  Learning rate: 0.0001
  Status: ✓ Good to go!

Setup 2: Frozen model (common mistake)
  Trainable params: 0 (0.0%)
  Status: ✗ Problems detected
    - CRITICAL: No trainable parameters! Model is completely frozen.

How to Fix Frozen Model:

Check: model.parameters() should have requires_grad=True
For LoRA: verify LoRA adapter was applied correctly
For full fine-tuning: don’t freeze anything
If using PEFT: call prepare_model_for_kbit_training()

Always run this check before training starts.

def check_overfitting(train_losses, val_losses, threshold=0.5): “”" Analyze train/val loss to detect overfitting.

Think of this as your "stop training now" alarm.
"""
gaps = [val - train for train, val in zip(train_losses, val_losses)]

# Is the gap growing beyond healthy range?
is_overfitting = len(gaps) > 1 and gaps[-1] > gaps[0] + threshold

# Classic overfitting pattern: val up, train down
val_increasing = len(val_losses) > 1 and val_losses[-1] > val_losses[-2]
train_decreasing = len(train_losses) > 1 and train_losses[-1] < train_losses[-2]
classic_overfit = val_increasing and train_decreasing

return {
    'gaps': gaps,
    'final_gap': gaps[-1] if gaps else 0,
    'is_overfitting': is_overfitting,
    'classic_pattern': classic_overfit,
    'recommendation': 'STOP TRAINING!' if is_overfitting else 'Keep going'
}

print(“Overfitting Detection”) print(“=” * 60)

Scenario 1: Healthy training¶

print(“\nScenario 1: Healthy training”) print(“(Both train and val improving together)”) healthy_train = [2.5, 2.0, 1.6, 1.3, 1.1] healthy_val = [2.6, 2.1, 1.7, 1.4, 1.2]

result = check_overfitting(healthy_train, healthy_val) print(f" Train: {healthy_train}“) print(f” Val: {healthy_val}“) print(f” Gaps: {[f’{g:.1f}’ for g in result[‘gaps’]]}“) print(f” Overfitting? {result[‘is_overfitting’]}“) print(f” → {result[‘recommendation’]}")

Scenario 2: Overfitting disaster¶

print(“\nScenario 2: Overfitting (train improving, val getting worse)”) overfit_train = [2.5, 1.8, 1.2, 0.8, 0.5] overfit_val = [2.6, 2.0, 2.0, 2.2, 2.5]

result = check_overfitting(overfit_train, overfit_val) print(f" Train: {overfit_train}“) print(f” Val: {overfit_val}“) print(f” Gaps: {[f’{g:.1f}’ for g in result[‘gaps’]]}“) print(f” Overfitting? {result[‘is_overfitting’]}“) print(f” Classic pattern? {result[‘classic_pattern’]}“) print(f” → {result[‘recommendation’]}“) print() print(” ^ See how the gap keeps growing? Model is memorizing,“) print(” not learning. Should have stopped at epoch 2.")

How to Fix Overfitting:

Prevention (do these first):

Add regularization (weight_decay=0.1)
Use dropout (lora_dropout=0.1)
Lower LoRA rank
Get more training data

Reaction (when it happens):

Stop training immediately
Use checkpoint from before overfitting started
Reduce number of epochs for next run

Always monitor both train AND val loss.

def evaluate_general_knowledge(model, tokenizer, test_cases):
    """
    Check if model still has basic general knowledge.
    
    You'd run this before and after fine-tuning to detect forgetting.
    (Here we just demonstrate the evaluation logic.)
    """
    results = []
    
    for prompt, expected_keywords in test_cases:
        # In real life: response = generate_from_model(model, tokenizer, prompt)
        # Here we simulate to show the concept
        response = f"[Would generate response for: {prompt}]"
        
        # Check if response contains expected answer keywords
        passed = any(kw.lower() in response.lower() for kw in expected_keywords)
        results.append({
            'prompt': prompt,
            'expected': expected_keywords,
            'passed': passed
        })
    
    accuracy = sum(r['passed'] for r in results) / len(results) if results else 0
    return accuracy, results

print("Catastrophic Forgetting Detection")
print("=" * 60)

# These are questions any language model should be able to answer
general_knowledge_tests = [
    ("What is 2 + 2?", ["4", "four"]),
    ("Who wrote Romeo and Juliet?", ["Shakespeare", "William"]),
    ("What is the capital of France?", ["Paris"]),
    ("What is water made of?", ["H2O", "hydrogen", "oxygen"]),
    ("What year did World War 2 end?", ["1945"]),
]

print("\nGeneral knowledge sanity checks:")
for i, (prompt, expected) in enumerate(general_knowledge_tests, 1):
    print(f"  {i}. {prompt}")
    print(f"     Expected: {', '.join(expected)}")

print(f"\n" + "-" * 60)
print("Example: Medical model that forgot everything else")
print()
print("Before fine-tuning:")
print("  Q: What is the capital of France?")
print("  A: The capital of France is Paris.")
print("  ✓ Correct")
print()
print("After aggressive fine-tuning on medical data:")
print("  Q: What is the capital of France?")
print("  A: The capital of France is a common symptom associated")
print("     with acute respiratory distress syndrome...")
print("  ✗ Model only speaks medical now")
print()
print("After fine-tuning with LoRA (less aggressive):")
print("  Q: What is the capital of France?")
print("  A: The capital of France is Paris.")
print("  ✓ Preserved general knowledge!")

Catastrophic Forgetting Detection
============================================================

General knowledge sanity checks:
  1. What is 2 + 2?
     Expected: 4, four
  2. Who wrote Romeo and Juliet?
     Expected: Shakespeare, William
  3. What is the capital of France?
     Expected: Paris
  4. What is water made of?
     Expected: H2O, hydrogen, oxygen
  5. What year did World War 2 end?
     Expected: 1945

------------------------------------------------------------
Example: Medical model that forgot everything else

Before fine-tuning:
  Q: What is the capital of France?
  A: The capital of France is Paris.
  ✓ Correct

After aggressive fine-tuning on medical data:
  Q: What is the capital of France?
  A: The capital of France is a common symptom associated
     with acute respiratory distress syndrome...
  ✗ Model only speaks medical now

After fine-tuning with LoRA (less aggressive):
  Q: What is the capital of France?
  A: The capital of France is Paris.
  ✓ Preserved general knowledge!

How to Prevent Catastrophic Forgetting:

Use LoRA instead of full fine-tuning (only modifies small adapters, not whole model)
Use lower learning rates (5e-5 instead of 1e-4 for full fine-tuning)
Mix general data with specialized data (10-20% general examples in training set)
Train for fewer epochs (stop when specialized performance plateaus)
For DPO/RLHF: Use KL penalty (keeps model close to reference)

Always test: Run these checks before AND after training!

import torch.nn.functional as F

def compute_kl_divergence(policy_logits, ref_logits): “”" Compute KL(policy || reference).

This measures how different the policy model's predictions are
from the reference model. High KL = big difference.
"""
policy_probs = F.softmax(policy_logits, dim=-1)
ref_log_probs = F.log_softmax(ref_logits, dim=-1)
policy_log_probs = F.log_softmax(policy_logits, dim=-1)

# KL divergence: sum of p * (log p - log q)
kl = (policy_probs * (policy_log_probs - ref_log_probs)).sum(-1).mean()

return kl.item()

def verify_reference_frozen(ref_model): “”“Check that reference model is actually frozen.”“” trainable = sum(1 for p in ref_model.parameters() if p.requires_grad) total = sum(1 for _ in ref_model.parameters())

return {
    'is_frozen': trainable == 0,
    'trainable_params': trainable,
    'total_params': total
}

print(“KL Divergence Monitoring”) print(“=” * 60)

Simulate some model outputs¶

vocab_size = 1000 batch_size = 4 seq_len = 10

torch.manual_seed(42) # For reproducibility ref_logits = torch.randn(batch_size, seq_len, vocab_size)

print(“\nKL Divergence Examples:”) print(“(Lower KL = models are similar, Higher KL = models diverged)”)

Case 1: Models are identical¶

policy_identical = ref_logits.clone() kl = compute_kl_divergence(policy_identical, ref_logits) print(f"\n 1. Policy = Reference: KL = {kl:.6f}“) print(f” ^ This is what you’d see at the very start of training")

Case 2: Small difference (healthy)¶

policy_small_diff = ref_logits + 0.1 * torch.randn_like(ref_logits) kl = compute_kl_divergence(policy_small_diff, ref_logits) print(f"\n 2. Small divergence: KL = {kl:.6f}“) print(f” ^ This is healthy - model is learning but staying close")

Case 3: Large difference (problem!)¶

policy_large_diff = ref_logits + 2.0 * torch.randn_like(ref_logits) kl = compute_kl_divergence(policy_large_diff, ref_logits) print(f"\n 3. Large divergence: KL = {kl:.6f}“) print(f” ^ This is bad - model has drifted too far")

print(f"\n" + “-” * 60) print(“Checking if Reference is Frozen:”)

Test 1: Properly frozen¶

frozen_model = SimpleModel() for param in frozen_model.parameters(): param.requires_grad = False

result = verify_reference_frozen(frozen_model) print(f"\n Correctly frozen reference:“) print(f” Trainable: {result[‘trainable_params’]}/{result[‘total_params’]}“) print(f” Status: {‘✓ Good!’ if result[‘is_frozen’] else ‘✗ Bug!’}")

Test 2: Accidentally not frozen (common bug!)¶

unfrozen_model = SimpleModel() # Oops, forgot to freeze

result = verify_reference_frozen(unfrozen_model) print(f"\n Accidentally unfrozen reference:“) print(f” Trainable: {result[‘trainable_params’]}/{result[‘total_params’]}“) print(f” Status: {‘✓ Good!’ if result[‘is_frozen’] else ‘✗ BUG - reference is being updated!’}")

How to Fix KL Divergence Problems:

Freeze the reference model:

for param in ref_model.parameters():
    param.requires_grad = False

Increase beta (KL penalty strength) in DPO (try 0.1 → 0.5)
Lower learning rate (try 1e-6 for DPO instead of 1e-5)
Use gradient clipping

Remember: Some KL divergence is good (means learning). But too much means the policy has gone rogue.

def test_loss_masking(labels_list):
    """
    Verify that loss masking is set up correctly.
    
    Correct: Some -100 (prompt), some real IDs (response)
    Wrong: All -100 (no training signal) or no -100 (learns prompts)
    """
    results = []
    
    for i, labels in enumerate(labels_list):
        masked = sum(1 for l in labels if l == -100)
        unmasked = sum(1 for l in labels if l != -100)
        total = len(labels)
        
        # Diagnose issues
        issue = None
        if unmasked == 0:
            issue = "All masked - no training signal!"
        elif masked == 0:
            issue = "Nothing masked - will learn to repeat prompts"
        elif unmasked < 5:
            issue = "Very few response tokens - weak signal"
        elif masked < 3:
            issue = "Very few prompt tokens - might learn wrong pattern"
        
        results.append({
            'example': i,
            'masked': masked,
            'unmasked': unmasked,
            'total': total,
            'issue': issue
        })
    
    return results

print("Loss Masking Verification")
print("=" * 60)

# Remember: -100 = ignore in loss, other values = compute loss
print("\nWhat labels should look like:")
print("  [-100, -100, -100, 42, 17, 89, ...]")
print("   ^^^^^^^^^^^^^      ^^^^^^^^^^^^")
print("   prompt (masked)    response (unmasked)")

# Test different scenarios
scenarios = {
    "Correct": [-100, -100, -100, -100, -100, 42, 17, 89, 33, 55],
    "All masked (bug!)": [-100] * 10,
    "Nothing masked (bug!)": [42, 17, 89, 33, 55, 12, 78, 34, 91, 23],
    "Too few response tokens": [-100] * 8 + [42, 17],
}

print(f"\n" + "-" * 60)
print("Testing different masking patterns:")

for name, labels in scenarios.items():
    results = test_loss_masking([labels])
    r = results[0]
    
    print(f"\n  {name}:")
    print(f"    Labels: {labels}")
    print(f"    Masked: {r['masked']}, Unmasked: {r['unmasked']}")
    
    if r['issue']:
        print(f"    ✗ ISSUE: {r['issue']}")
    else:
        print(f"    ✓ Looks good")

Loss Masking Verification
============================================================

What labels should look like:
  [-100, -100, -100, 42, 17, 89, ...]
   ^^^^^^^^^^^^^      ^^^^^^^^^^^^
   prompt (masked)    response (unmasked)

------------------------------------------------------------
Testing different masking patterns:

  Correct:
    Labels: [-100, -100, -100, -100, -100, 42, 17, 89, 33, 55]
    Masked: 5, Unmasked: 5
    ✓ Looks good

  All masked (bug!):
    Labels: [-100, -100, -100, -100, -100, -100, -100, -100, -100, -100]
    Masked: 10, Unmasked: 0
    ✗ ISSUE: All masked - no training signal!

  Nothing masked (bug!):
    Labels: [42, 17, 89, 33, 55, 12, 78, 34, 91, 23]
    Masked: 0, Unmasked: 10
    ✗ ISSUE: Nothing masked - will learn to repeat prompts

  Too few response tokens:
    Labels: [-100, -100, -100, -100, -100, -100, -100, -100, 42, 17]
    Masked: 8, Unmasked: 2
    ✗ ISSUE: Very few response tokens - weak signal

How to Fix Loss Masking:

Correct pattern:

Tokenize prompt → set labels to -100
Tokenize response → set labels to token IDs
Concatenate both

Example:

prompt_tokens = [1, 2, 3, 4]
response_tokens = [5, 6, 7, 8]

input_ids = [1, 2, 3, 4, 5, 6, 7, 8]
labels = [-100, -100, -100, -100, 5, 6, 7, 8]
          # ^^^ prompt ^^^  ^^^ response ^^^

Always print a few examples from your dataloader to verify masking is correct before training!

import numpy as np

def apply_reward_constraints(response, base_reward): “”" Add rule-based penalties to catch reward hacking.

Think of this as guardrails that prevent obvious exploits.
"""
words = response.split()
penalties = []
reward = base_reward

# Penalize repetition
if words:
    unique_words = len(set(words))
    total_words = len(words)
    unique_ratio = unique_words / total_words
    
    if unique_ratio < 0.5:  # More than half are repeats
        penalty = 5.0
        reward -= penalty
        penalties.append(f"Repetition penalty: -{penalty:.1f} (only {unique_ratio:.0%} unique)")

# Penalize extreme lengths
if len(words) > 300:
    penalty = 2.0
    reward -= penalty
    penalties.append(f"Too verbose: -{penalty:.1f} ({len(words)} words)")

if len(words) < 5:
    penalty = 3.0
    reward -= penalty
    penalties.append(f"Too short: -{penalty:.1f} ({len(words)} words)")

return reward, penalties

def check_reward_hacking(responses, rewards): “”“Detect if the policy is gaming the reward model.”“” warnings = []

# Check for suspiciously uniform rewards
if len(rewards) > 1 and np.std(rewards) < 0.1:
    warnings.append("All rewards very similar - possible exploitation")

# Check high-reward responses for obvious hacking
if rewards:
    high_reward_idx = np.argsort(rewards)[-min(3, len(rewards)):]
    
    for idx in high_reward_idx:
        words = responses[idx].split()
        if words:
            unique_ratio = len(set(words)) / len(words)
            if unique_ratio < 0.5:
                warnings.append(
                    f"Response {idx} (reward={rewards[idx]:.1f}) is {unique_ratio:.0%} repetitive"
                )

return warnings

print(“Reward Hacking Detection”) print(“=” * 60)

Simulate different types of responses¶

responses = [ “Here is a helpful and informative response to your question.”, “Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris.”, # Repetitive hack! “Yes”, # Too short “The capital of France is Paris, a beautiful city known for its culture and history.”, ]

base_rewards = [7.5, 9.0, 2.0, 8.0] # Note: repetitive one got high reward!

print(“\nApplying Reward Constraints:”) print(“(Catching exploits with rule-based penalties)”)

for i, (response, base_reward) in enumerate(zip(responses, base_rewards)): print(f"\n Response {i}: "{response[:60]}{‘...’ if len(response) > 60 else ‘’}"")

adjusted, penalties = apply_reward_constraints(response, base_reward)

print(f"    Base reward: {base_reward:.1f}")
print(f"    Adjusted reward: {adjusted:.1f}")

if penalties:
    print(f"    Penalties applied:")
    for p in penalties:
        print(f"      • {p}")

print(f"\n" + “-” * 60) print(“Checking for Systematic Hacking:”)

warnings = check_reward_hacking(responses, base_rewards) if warnings: print(" ⚠ Warning signs detected:“) for w in warnings: print(f” • {w}“) else: print(” ✓ No obvious hacking detected")

How to Prevent Reward Hacking:

Increase KL penalty (beta parameter) → Keeps model close to reference, prevents exploitation
Add rule-based constraints (as shown above) → Catches obvious patterns like repetition
Use ensemble of reward models → Harder to hack multiple models at once
Train reward model on diverse, adversarial examples → Include examples of hacking in training data
Manual review of high-reward outputs → Human-in-the-loop catches what automated checks miss

Remember: If rewards are going up but outputs are getting worse, you’re being hacked!

def bisect_debug(model, sample_batch, optimizer):
    """
    Find which component is broken by testing each step.
    
    This is like checking each domino in a chain to find which one
    is broken. Start at the beginning, test each piece.
    """
    results = {}
    
    # Step 1: Can we access the model?
    try:
        _ = sum(1 for _ in model.parameters())
        results['model_accessible'] = {'passed': True, 'error': None}
    except Exception as e:
        results['model_accessible'] = {'passed': False, 'error': str(e)}
        return results  # Can't continue without model
    
    # Step 2: Can we run a forward pass?
    try:
        outputs = model(sample_batch)
        results['forward_pass'] = {'passed': True, 'error': None}
    except Exception as e:
        results['forward_pass'] = {'passed': False, 'error': str(e)}
        return results  # Can't continue without forward pass
    
    # Step 3: Can we compute gradients?
    try:
        loss = outputs.sum()  # Simple loss for testing
        loss.backward()
        results['backward_pass'] = {'passed': True, 'error': None}
    except Exception as e:
        results['backward_pass'] = {'passed': False, 'error': str(e)}
        return results  # Can't continue without gradients
    
    # Step 4: Can we update weights?
    try:
        optimizer.step()
        results['optimizer_step'] = {'passed': True, 'error': None}
    except Exception as e:
        results['optimizer_step'] = {'passed': False, 'error': str(e)}
    
    return results

def check_gradients(model):
    """
    Check gradient health across all parameters.
    
    Gradients should be: not zero, not NaN, not too large.
    """
    grad_stats = {
        'zero_grads': [],      # Parameters with zero gradient
        'large_grads': [],     # Parameters with suspiciously large gradients
        'nan_grads': [],       # Parameters with NaN gradients
        'normal_grads': 0      # Parameters with normal gradients
    }
    
    grad_norms = []
    
    for name, param in model.named_parameters():
        if param.requires_grad and param.grad is not None:
            grad_norm = param.grad.norm().item()
            grad_norms.append(grad_norm)
            
            if torch.isnan(param.grad).any():
                grad_stats['nan_grads'].append(name)
            elif grad_norm == 0:
                grad_stats['zero_grads'].append(name)
            elif grad_norm > 100:
                grad_stats['large_grads'].append((name, f"{grad_norm:.2f}"))
            else:
                grad_stats['normal_grads'] += 1
    
    grad_stats['avg_norm'] = np.mean(grad_norms) if grad_norms else 0
    grad_stats['max_norm'] = max(grad_norms) if grad_norms else 0
    
    return grad_stats

print("Debugging Utilities")
print("=" * 60)

# Demo 1: Bisect debugging
print("\n1. Bisect Debugging")
print("   (Find which component is failing)")

model = SimpleModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
sample_input = torch.randn(4, 10)

results = bisect_debug(model, sample_input, optimizer)

print()
for test_name, result in results.items():
    status = "✓ PASS" if result['passed'] else f"✗ FAIL"
    print(f"  {test_name:20s} {status}")
    if result['error']:
        print(f"    Error: {result['error']}")

print()
print("  → All steps passed! Model and optimizer working correctly.")

# Demo 2: Gradient checking
print(f"\n" + "-" * 60)
print("2. Gradient Health Check")
print("   (Make sure gradients are reasonable)")

# Reset and compute gradients
model = SimpleModel()
x = torch.randn(4, 10)
y = model(x)
y.sum().backward()

grad_stats = check_gradients(model)

print()
print(f"  Normal gradients:  {grad_stats['normal_grads']}")
print(f"  Zero gradients:    {len(grad_stats['zero_grads'])}")
print(f"  NaN gradients:     {len(grad_stats['nan_grads'])}")
print(f"  Large gradients:   {len(grad_stats['large_grads'])}")
print()
print(f"  Average magnitude: {grad_stats['avg_norm']:.4f}")
print(f"  Max magnitude:     {grad_stats['max_norm']:.4f}")
print()
print("  → Gradients look healthy!")

# Demo 3: Detecting a problem
print(f"\n" + "-" * 60)
print("3. Detecting Gradient Problems")

# Inject an issue
model.linear1.weight.grad = torch.zeros_like(model.linear1.weight.grad)
grad_stats = check_gradients(model)

print()
print(f"  After zeroing linear1.weight gradient:")
print(f"    Zero gradients detected: {grad_stats['zero_grads']}")
print()
print("  ^ This would indicate linear1.weight isn't being trained!")

Debugging Utilities
============================================================

1. Bisect Debugging
   (Find which component is failing)

  model_accessible     ✓ PASS
  forward_pass         ✓ PASS
  backward_pass        ✓ PASS
  optimizer_step       ✓ PASS

  → All steps passed! Model and optimizer working correctly.

------------------------------------------------------------
2. Gradient Health Check
   (Make sure gradients are reasonable)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[5], line 112
    109 y = model(x)
    110 y.sum().backward()
--> 112 grad_stats = check_gradients(model)
    114 print()
    115 print(f"  Normal gradients:  {grad_stats['normal_grads']}")

Cell In[5], line 73, in check_gradients(model)
     70         else:
     71             grad_stats['normal_grads'] += 1
---> 73 grad_stats['avg_norm'] = np.mean(grad_norms) if grad_norms else 0
     74 grad_stats['max_norm'] = max(grad_norms) if grad_norms else 0
     76 return grad_stats

NameError: name 'np' is not defined

When to Use These Tools:

Bisect debugging:

Training crashes with cryptic error
Not sure which component is broken
Want to isolate the problem

Gradient checking:

Loss not decreasing
Suspicious training behavior
After making architecture changes

Add these checks to your training loop during development. Remove them once everything is working.

Pitfall 6: The Loss Masking Bug¶

The Story:

Your model isn’t learning. At all. Loss is doing something weird.

You debug everything. Learning rate? Fine. Gradients? Fine. Data? Fine.

Then you print out your labels:

print(labels)
# Output: tensor([-100, -100, -100, -100, ...])  # All -100!

Oh.

See, in causal language modeling, we use -100 as the label for tokens we want to ignore in the loss. Typically the prompt tokens. We only compute loss on the response tokens.

But if ALL your labels are -100, you’re not computing loss on anything. The model has no training signal.

Or worse: maybe NONE of your labels are -100. So you’re training the model to predict the prompt tokens too. Which means it learns to generate prompts, not responses.

What Happened:

Your data processing pipeline messed up the loss masking. Maybe you:

Used the wrong tokenizer function
Forgot to set labels at all (defaults to -100)
Set labels incorrectly (no -100 where there should be)
Had an off-by-one error in where to start masking

This bug is silent and deadly. No error messages. Training runs fine. Model just doesn’t learn anything useful.

How to Spot It:

Model not learning? First thing to check: print out a few examples from your dataloader and verify the labels are partially -100 (for prompt) and partially real token IDs (for response).

def test_loss_masking(labels_list):
    """
    Verify that loss masking is set up correctly.
    
    Correct: Some -100 (prompt), some real IDs (response)
    Wrong: All -100 (no training signal) or no -100 (learns prompts)
    """
    results = []
    
    for i, labels in enumerate(labels_list):
        masked = sum(1 for l in labels if l == -100)
        unmasked = sum(1 for l in labels if l != -100)
        total = len(labels)
        
        # Diagnose issues
        issue = None
        if unmasked == 0:
            issue = "All masked - no training signal!"
        elif masked == 0:
            issue = "Nothing masked - will learn to repeat prompts"
        elif unmasked < 5:
            issue = "Very few response tokens - weak signal"
        elif masked < 3:
            issue = "Very few prompt tokens - might learn wrong pattern"
        
        results.append({
            'example': i,
            'masked': masked,
            'unmasked': unmasked,
            'total': total,
            'issue': issue
        })
    
    return results

print("Loss Masking Verification")
print("=" * 60)

# Remember: -100 = ignore in loss, other values = compute loss
print("\nWhat labels should look like:")
print("  [-100, -100, -100, 42, 17, 89, ...]")
print("   ^^^^^^^^^^^^^      ^^^^^^^^^^^^")
print("   prompt (masked)    response (unmasked)")

# Test different scenarios
scenarios = {
    "Correct": [-100, -100, -100, -100, -100, 42, 17, 89, 33, 55],
    "All masked (bug!)": [-100] * 10,
    "Nothing masked (bug!)": [42, 17, 89, 33, 55, 12, 78, 34, 91, 23],
    "Too few response tokens": [-100] * 8 + [42, 17],
}

print(f"\n" + "-" * 60)
print("Testing different masking patterns:")

for name, labels in scenarios.items():
    results = test_loss_masking([labels])
    r = results[0]
    
    print(f"\n  {name}:")
    print(f"    Labels: {labels}")
    print(f"    Masked: {r['masked']}, Unmasked: {r['unmasked']}")
    
    if r['issue']:
        print(f"    ✗ ISSUE: {r['issue']}")
    else:
        print(f"    ✓ Looks good")

print(f"\n" + "=" * 60)
print("How to Fix Loss Masking:")
print()
print("  Correct pattern:")
print("    1. Tokenize prompt → set labels to -100")
print("    2. Tokenize response → set labels to token IDs")
print("    3. Concatenate both")
print()
print("  Example:")
print("    prompt_tokens = [1, 2, 3, 4]")
print("    response_tokens = [5, 6, 7, 8]")
print("    ")
print("    input_ids = [1, 2, 3, 4, 5, 6, 7, 8]")
print("    labels = [-100, -100, -100, -100, 5, 6, 7, 8]")
print("              ^^^ prompt ^^^  ^^^ response ^^^")
print()
print("Always print a few examples from your dataloader")
print("to verify masking is correct before training!")

Pitfall 7: Reward Hacking¶

The Story:

You’re doing RLHF. Your reward model prefers longer, more detailed responses.

You train your policy model. The rewards are going up! Success!

You check the outputs:

Prompt: "What is the capital of France?"

Response: "The capital of France is Paris Paris Paris Paris Paris 
Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris
Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris..."

Reward: 9.8/10  # High reward!

Your model discovered that the reward model likes long responses. So it just... repeats things. Forever. Gets great rewards. Completely useless.

This is reward hacking. The model found a loophole in your reward function and exploited it.

It’s like when you tell a kid to clean their room, and they shove everything under the bed. Technically clean! Reward achieved! Completely missing the point.

What Happened:

Reward models are imperfect. They capture some aspects of what makes a good response, but not all. And RL algorithms are very good at finding and exploiting edge cases.

If your reward model gives high rewards for length, the policy will maximize length (regardless of quality).
If it rewards confidence, you get overconfident nonsense.
If it rewards using specific words, you get word salad containing those words.

The policy is just optimizing for reward. It doesn’t “know” what you actually wanted.

How to Spot It:

High rewards, terrible outputs. Or outputs that are obviously exploiting some pattern (all the same length, same structure, repetitive, etc.).

import numpy as np

def apply_reward_constraints(response, base_reward):
    """
    Add rule-based penalties to catch reward hacking.
    
    Think of this as guardrails that prevent obvious exploits.
    """
    words = response.split()
    penalties = []
    reward = base_reward
    
    # Penalize repetition
    if words:
        unique_words = len(set(words))
        total_words = len(words)
        unique_ratio = unique_words / total_words
        
        if unique_ratio < 0.5:  # More than half are repeats
            penalty = 5.0
            reward -= penalty
            penalties.append(f"Repetition penalty: -{penalty:.1f} (only {unique_ratio:.0%} unique)")
    
    # Penalize extreme lengths
    if len(words) > 300:
        penalty = 2.0
        reward -= penalty
        penalties.append(f"Too verbose: -{penalty:.1f} ({len(words)} words)")
    
    if len(words) < 5:
        penalty = 3.0
        reward -= penalty
        penalties.append(f"Too short: -{penalty:.1f} ({len(words)} words)")
    
    return reward, penalties

def check_reward_hacking(responses, rewards):
    """Detect if the policy is gaming the reward model."""
    warnings = []
    
    # Check for suspiciously uniform rewards
    if len(rewards) > 1 and np.std(rewards) < 0.1:
        warnings.append("All rewards very similar - possible exploitation")
    
    # Check high-reward responses for obvious hacking
    if rewards:
        high_reward_idx = np.argsort(rewards)[-min(3, len(rewards)):]
        
        for idx in high_reward_idx:
            words = responses[idx].split()
            if words:
                unique_ratio = len(set(words)) / len(words)
                if unique_ratio < 0.5:
                    warnings.append(
                        f"Response {idx} (reward={rewards[idx]:.1f}) is {unique_ratio:.0%} repetitive"
                    )
    
    return warnings

print("Reward Hacking Detection")
print("=" * 60)

# Simulate different types of responses
responses = [
    "Here is a helpful and informative response to your question.",
    "Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris Paris.",  # Repetitive hack!
    "Yes",  # Too short
    "The capital of France is Paris, a beautiful city known for its culture and history.",
]

base_rewards = [7.5, 9.0, 2.0, 8.0]  # Note: repetitive one got high reward!

print("\nApplying Reward Constraints:")
print("(Catching exploits with rule-based penalties)")

for i, (response, base_reward) in enumerate(zip(responses, base_rewards)):
    print(f"\n  Response {i}: \"{response[:60]}{'...' if len(response) > 60 else ''}\"")
    
    adjusted, penalties = apply_reward_constraints(response, base_reward)
    
    print(f"    Base reward: {base_reward:.1f}")
    print(f"    Adjusted reward: {adjusted:.1f}")
    
    if penalties:
        print(f"    Penalties applied:")
        for p in penalties:
            print(f"      • {p}")

print(f"\n" + "-" * 60)
print("Checking for Systematic Hacking:")

warnings = check_reward_hacking(responses, base_rewards)
if warnings:
    print("  ⚠ Warning signs detected:")
    for w in warnings:
        print(f"    • {w}")
else:
    print("  ✓ No obvious hacking detected")

print(f"\n" + "=" * 60)
print("How to Prevent Reward Hacking:")
print()
print("  1. Increase KL penalty (beta parameter)")
print("     → Keeps model close to reference, prevents exploitation")
print()
print("  2. Add rule-based constraints (as shown above)")
print("     → Catches obvious patterns like repetition")
print()
print("  3. Use ensemble of reward models")
print("     → Harder to hack multiple models at once")
print()
print("  4. Train reward model on diverse, adversarial examples")
print("     → Include examples of hacking in training data")
print()
print("  5. Manual review of high-reward outputs")
print("     → Human-in-the-loop catches what automated checks miss")
print()
print("Remember: If rewards are going up but outputs are getting")
print("worse, you're being hacked!")

Debugging Strategies¶

When something breaks (and it will), here’s how to find the problem:

Think of debugging like a doctor diagnosing a patient. You don’t just guess. You run tests, narrow down possibilities, find the root cause.

Here are two debugging patterns I use constantly.

def bisect_debug(model, sample_batch, optimizer):
    """
    Find which component is broken by testing each step.
    
    This is like checking each domino in a chain to find which one
    is broken. Start at the beginning, test each piece.
    """
    results = {}
    
    # Step 1: Can we access the model?
    try:
        _ = sum(1 for _ in model.parameters())
        results['model_accessible'] = {'passed': True, 'error': None}
    except Exception as e:
        results['model_accessible'] = {'passed': False, 'error': str(e)}
        return results  # Can't continue without model
    
    # Step 2: Can we run a forward pass?
    try:
        outputs = model(sample_batch)
        results['forward_pass'] = {'passed': True, 'error': None}
    except Exception as e:
        results['forward_pass'] = {'passed': False, 'error': str(e)}
        return results  # Can't continue without forward pass
    
    # Step 3: Can we compute gradients?
    try:
        loss = outputs.sum()  # Simple loss for testing
        loss.backward()
        results['backward_pass'] = {'passed': True, 'error': None}
    except Exception as e:
        results['backward_pass'] = {'passed': False, 'error': str(e)}
        return results  # Can't continue without gradients
    
    # Step 4: Can we update weights?
    try:
        optimizer.step()
        results['optimizer_step'] = {'passed': True, 'error': None}
    except Exception as e:
        results['optimizer_step'] = {'passed': False, 'error': str(e)}
    
    return results

def check_gradients(model):
    """
    Check gradient health across all parameters.
    
    Gradients should be: not zero, not NaN, not too large.
    """
    grad_stats = {
        'zero_grads': [],      # Parameters with zero gradient
        'large_grads': [],     # Parameters with suspiciously large gradients
        'nan_grads': [],       # Parameters with NaN gradients
        'normal_grads': 0      # Parameters with normal gradients
    }
    
    grad_norms = []
    
    for name, param in model.named_parameters():
        if param.requires_grad and param.grad is not None:
            grad_norm = param.grad.norm().item()
            grad_norms.append(grad_norm)
            
            if torch.isnan(param.grad).any():
                grad_stats['nan_grads'].append(name)
            elif grad_norm == 0:
                grad_stats['zero_grads'].append(name)
            elif grad_norm > 100:
                grad_stats['large_grads'].append((name, f"{grad_norm:.2f}"))
            else:
                grad_stats['normal_grads'] += 1
    
    grad_stats['avg_norm'] = np.mean(grad_norms) if grad_norms else 0
    grad_stats['max_norm'] = max(grad_norms) if grad_norms else 0
    
    return grad_stats

print("Debugging Utilities")
print("=" * 60)

# Demo 1: Bisect debugging
print("\n1. Bisect Debugging")
print("   (Find which component is failing)")

model = SimpleModel()
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-4)
sample_input = torch.randn(4, 10)

results = bisect_debug(model, sample_input, optimizer)

print()
for test_name, result in results.items():
    status = "✓ PASS" if result['passed'] else f"✗ FAIL"
    print(f"  {test_name:20s} {status}")
    if result['error']:
        print(f"    Error: {result['error']}")

print()
print("  → All steps passed! Model and optimizer working correctly.")

# Demo 2: Gradient checking
print(f"\n" + "-" * 60)
print("2. Gradient Health Check")
print("   (Make sure gradients are reasonable)")

# Reset and compute gradients
model = SimpleModel()
x = torch.randn(4, 10)
y = model(x)
y.sum().backward()

grad_stats = check_gradients(model)

print()
print(f"  Normal gradients:  {grad_stats['normal_grads']}")
print(f"  Zero gradients:    {len(grad_stats['zero_grads'])}")
print(f"  NaN gradients:     {len(grad_stats['nan_grads'])}")
print(f"  Large gradients:   {len(grad_stats['large_grads'])}")
print()
print(f"  Average magnitude: {grad_stats['avg_norm']:.4f}")
print(f"  Max magnitude:     {grad_stats['max_norm']:.4f}")
print()
print("  → Gradients look healthy!")

# Demo 3: Detecting a problem
print(f"\n" + "-" * 60)
print("3. Detecting Gradient Problems")

# Inject an issue
model.linear1.weight.grad = torch.zeros_like(model.linear1.weight.grad)
grad_stats = check_gradients(model)

print()
print(f"  After zeroing linear1.weight gradient:")
print(f"    Zero gradients detected: {grad_stats['zero_grads']}")
print()
print("  ^ This would indicate linear1.weight isn't being trained!")

print(f"\n" + "=" * 60)
print("When to Use These Tools:")
print()
print("  Bisect debugging:")
print("    • Training crashes with cryptic error")
print("    • Not sure which component is broken")
print("    • Want to isolate the problem")
print()
print("  Gradient checking:")
print("    • Loss not decreasing")
print("    • Suspicious training behavior")
print("    • After making architecture changes")
print()
print("Add these checks to your training loop during")
print("development. Remove them once everything is working.")

The Hall of Shame¶

Most common mistakes, ranked by how much time they waste:

1. Learning Rate Too High¶

Symptom: Loss becomes NaN
Time wasted: 3+ hours before you notice
Fix: Reduce LR by 10x, add gradient clipping
Prevention: Start conservative (1e-5), increase if needed

2. Wrong Loss Masking¶

Symptom: Model doesn’t learn anything useful
Time wasted: Could be days before you realize
Fix: Print your labels, verify -100 placement
Prevention: Always inspect first batch before training

3. Frozen Model¶

Symptom: Loss barely moves
Time wasted: However long you wait before checking
Fix: Check requires_grad, enable if needed
Prevention: Print trainable parameter count at startup

4. Overfitting¶

Symptom: Train loss goes down, val loss goes up
Time wasted: All epochs past the sweet spot
Fix: Use earlier checkpoint, reduce epochs
Prevention: Monitor both train and val loss

5. Reference Not Frozen (DPO/RLHF)¶

Symptom: KL divergence explodes
Time wasted: Full training run before you notice
Fix: Freeze reference model, restart
Prevention: Check requires_grad on reference

6. No Gradient Clipping¶

Symptom: Training unstable, occasional NaN
Time wasted: Multiple failed runs
Fix: Add max_grad_norm=1.0
Prevention: Always enable gradient clipping

7. Catastrophic Forgetting¶

Symptom: Model only speaks your domain language
Time wasted: Only noticed during final evaluation
Fix: Start over with LoRA or lower LR
Prevention: Test general knowledge before and after

8. Reward Hacking¶

Symptom: High rewards, terrible outputs
Time wasted: Full RLHF training run
Fix: Increase KL penalty, add constraints
Prevention: Manually check high-reward samples

9. Bad Data Quality¶

Symptom: Model learns nonsense patterns
Time wasted: Could be forever if you don’t realize
Fix: Clean your data
Prevention: Manually inspect training examples

10. Batch Size Too Large¶

Symptom: CUDA out of memory
Time wasted: 5 minutes per crash
Fix: Reduce batch size, enable gradient checkpointing
Prevention: Start small, increase until OOM, then back off

Quick Reference Table¶

Problem	Symptom	Quick Fix
Loss = NaN	Sudden infinity→NaN	LR ÷ 10, add grad clipping
Loss stuck	Barely changing	Check trainable params
Train << Val	Growing gap	Stop early, add regularization
Model speaks only domain	Failed general knowledge	Use LoRA, lower LR
KL too high	Divergence > 1.0	Increase beta, lower LR
OOM	CUDA memory error	Reduce batch size

Print this table. Tape it to your monitor. Thank me later.

You Made It!¶

Congratulations. You now know how to break and fix transformer training.

More importantly, you know how to debug it. Because that’s the real skill.

Anyone can copy a training script and run it. The question is: what do you do when it breaks?

Now you know:

How to recognize the seven deadliest pitfalls
How to diagnose what’s actually wrong
How to fix it quickly instead of wasting days
How to prevent the problem next time

What You’ve Learned (The Whole Series)¶

Looking back at this entire fine-tuning section:

SFT: You learned how to teach a model new behaviors through examples, with proper instruction formatting and loss masking.

Reward Models: You learned how to capture human preferences in a model that scores responses.

RLHF: You learned how to use reinforcement learning (PPO) to optimize for those preferences, with all its complexity.

DPO: You learned a simpler approach that skips RL entirely and optimizes preferences directly.

Advanced Topics: You learned about memory optimization, hyperparameter tuning, and evaluation metrics.

Debugging: (This notebook) You learned what goes wrong and how to fix it.

That’s the full pipeline. From raw model to fine-tuned, preference-aligned, debugged system.

What’s Next?¶

Go try it. Pick a model. Pick a task. Fine-tune something.

You’ll break things. That’s fine. You now know how to fix them.

And when you inevitably spend three hours debugging, only to discover you forgot to set requires_grad=True?

You’ll laugh. Print out this notebook. Tape it to your wall.

Welcome to the club.

Check out the Try It notebook if you want hands-on practice with these debugging techniques!

The Pattern¶

Pitfall 1: The NaN Death Spiral¶

Pitfall 2: The Frozen Model Mystery¶

Scenario 1: Healthy training¶

Scenario 2: Overfitting disaster¶

Simulate some model outputs¶

Case 1: Models are identical¶

Case 2: Small difference (healthy)¶

Case 3: Large difference (problem!)¶

Test 1: Properly frozen¶

Test 2: Accidentally not frozen (common bug!)¶

Simulate different types of responses¶

Pitfall 6: The Loss Masking Bug¶

Pitfall 7: Reward Hacking¶

Debugging Strategies¶

The Pre-Flight Checklist¶

Environment¶

Data¶

Model¶

Optimizer¶

Training Loop¶

Method-Specific Checks¶

The Hall of Shame¶

1. Learning Rate Too High¶

2. Wrong Loss Masking¶

3. Frozen Model¶

4. Overfitting¶

5. Reference Not Frozen (DPO/RLHF)¶

6. No Gradient Clipping¶

7. Catastrophic Forgetting¶

8. Reward Hacking¶

9. Bad Data Quality¶

10. Batch Size Too Large¶

Quick Reference Table¶

You Made It!¶

What You’ve Learned (The Whole Series)¶

What’s Next?¶