Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

DPO Training: Alignment Without the Reward Model

We’ve covered reward models. We’ve seen how they judge which responses are better. And maybe you’re thinking: okay, so we train a reward model, then use it to fine-tune our language model.

But what if I told you there’s a shortcut?

What if you could skip the reward model entirely and train your language model directly on preference data?

That’s DPO. Direct Preference Optimization.

And it’s kind of brilliant.

The Traditional Pipeline (And Why It’s Complicated)

The classic RLHF approach goes like this:

1. Start with a base language model
2. Train a reward model on preference data (that's what we just did!)
3. Use that reward model to score responses
4. Use reinforcement learning (PPO) to optimize the language model
5. Deal with all the instability and complexity that PPO brings

It works. Companies use it. But it’s a pain.

You’re juggling multiple models, dealing with RL instability, tuning a bunch of hyperparameters, and basically hoping everything converges nicely.

The DPO Insight

Here’s the key realization that led to DPO:

If you know the optimal policy (the perfect language model), you can derive what the optimal reward function must be.

Read that again. It’s wild.

We usually think: reward function → optimal policy. But it also works backward: optimal policy → reward function.

So instead of:

  1. Train reward model

  2. Use reward model to train policy

We do:

  1. Train policy directly to match what the optimal policy would be

No reward model. No RL. Just a clean loss function that optimizes your language model directly on preferences.

It’s like... instead of learning how to judge food quality and then using those judgments to become a better chef, you just study what makes great food great and cook accordingly.

Does it work? Hell yes. In many cases, it works better than RLHF.

import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModelForCausalLM, AutoTokenizer, get_linear_schedule_with_warmup
import copy
from tqdm.auto import tqdm
from dataclasses import dataclass

@dataclass
class DPOConfig:
    """Configuration for DPO training.
    
    These hyperparameters control how we train. Let's break down what each one does:
    
    - model_name: Which pretrained model to start from
    - beta: How much we penalize diverging from the reference model (more on this later!)
    - learning_rate: How big our update steps are (very small, like RLHF)
    - batch_size: How many preference pairs per training step
    - num_epochs: How many times we go through the dataset
    - max_length: Maximum sequence length in tokens
    - warmup_steps: Gradually increase learning rate at the start
    - max_grad_norm: Clip gradients to prevent exploding updates
    - label_smoothing: Optional regularization (we'll leave it at 0)
    """
    model_name: str = "gpt2"
    beta: float = 0.1
    learning_rate: float = 1e-6
    batch_size: int = 4
    num_epochs: int = 1
    max_length: int = 512
    warmup_steps: int = 100
    max_grad_norm: float = 1.0
    label_smoothing: float = 0.0

config = DPOConfig()
print("DPO Configuration:")
for k, v in vars(config).items():
    print(f"  {k}: {v}")
DPO Configuration:
  model_name: gpt2
  beta: 0.1
  learning_rate: 1e-06
  batch_size: 4
  num_epochs: 1
  max_length: 512
  warmup_steps: 100
  max_grad_norm: 1.0
  label_smoothing: 0.0

The DPO Dataset

Remember our preference data format? Prompts with chosen and rejected responses?

That’s exactly what we need here too. Same format, different use case.

The dataset needs to:

  1. Take each preference pair

  2. Tokenize both the chosen and rejected responses

  3. Return them in a format ready for training

Nothing fancy. We’re just preparing the data for our loss function to work with.

class DPODataset(Dataset):
    """
    Dataset for DPO training.
    
    Each item contains:
    - chosen: The preferred response
    - rejected: The dispreferred response
    
    We tokenize both and return them ready for the training loop.
    """
    
    def __init__(self, data, tokenizer, max_length=512):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        item = self.data[idx]
        
        # Tokenize chosen response
        chosen_tokens = self.tokenizer(
            item['chosen'],
            max_length=self.max_length,
            truncation=True,
            padding='max_length',
            return_tensors='pt'
        )
        
        # Tokenize rejected response
        rejected_tokens = self.tokenizer(
            item['rejected'],
            max_length=self.max_length,
            truncation=True,
            padding='max_length',
            return_tensors='pt'
        )
        
        return {
            'chosen_input_ids': chosen_tokens['input_ids'].squeeze(0),
            'chosen_attention_mask': chosen_tokens['attention_mask'].squeeze(0),
            'rejected_input_ids': rejected_tokens['input_ids'].squeeze(0),
            'rejected_attention_mask': rejected_tokens['attention_mask'].squeeze(0),
        }

# Let's see it in action with a simple example
print("DPODataset Demo")
print("=" * 60)

# Create a tiny example dataset
example_data = [
    {
        'chosen': "The capital of France is Paris, a beautiful city known for the Eiffel Tower.",
        'rejected': "I don't know what the capital of France is."
    }
]

# We need a tokenizer
from transformers import AutoTokenizer
demo_tokenizer = AutoTokenizer.from_pretrained("gpt2")
demo_tokenizer.pad_token = demo_tokenizer.eos_token

# Create the dataset
demo_dataset = DPODataset(example_data, demo_tokenizer, max_length=64)
sample = demo_dataset[0]

print(f"\nWhat we get from the dataset:")
print(f"  chosen_input_ids shape: {sample['chosen_input_ids'].shape}")
print(f"  chosen_attention_mask shape: {sample['chosen_attention_mask'].shape}")
print(f"  rejected_input_ids shape: {sample['rejected_input_ids'].shape}")
print(f"  rejected_attention_mask shape: {sample['rejected_attention_mask'].shape}")

# Show what the tokens look like when decoded
print(f"\nChosen response (decoded):")
print(f'  "{demo_tokenizer.decode(sample["chosen_input_ids"][:20])}..."')
print(f"\nRejected response (decoded):")
print(f'  "{demo_tokenizer.decode(sample["rejected_input_ids"][:15])}..."')

print("\nPerfect! Both responses tokenized and ready for training.")
DPODataset Demo
============================================================

What we get from the dataset:
  chosen_input_ids shape: torch.Size([64])
  chosen_attention_mask shape: torch.Size([64])
  rejected_input_ids shape: torch.Size([64])
  rejected_attention_mask shape: torch.Size([64])

Chosen response (decoded):
  "The capital of France is Paris, a beautiful city known for the Eiffel Tower.<|endoftext|><|endoftext|>..."

Rejected response (decoded):
  "I don't know what the capital of France is.<|endoftext|><|endoftext|><|endoftext|><|endoftext|>..."

Perfect! Both responses tokenized and ready for training.

Computing Log Probabilities: The Foundation

Alright, here’s where we get into the mechanics.

DPO needs to know: how likely is this sequence of tokens under my model?

This is actually the same computation we do during normal language model training. We run the model, get logits for each position, convert to probabilities, and sum up the log probabilities of the actual tokens.

But there’s a catch: we need to do this for entire sequences, not just individual tokens.

Why? Because DPO compares complete responses. We want to know: did the model think response A was more likely than response B?

Let me show you how we compute this.

def get_sequence_log_probs(
    model,
    input_ids: torch.Tensor,
    attention_mask: torch.Tensor
) -> torch.Tensor:
    """
    Compute the total log probability of a sequence under the model.
    
    This is THE key computation in DPO. We need to know:
    "How likely is this entire sequence under my model?"
    
    The process:
    1. Run the model to get logits at each position
    2. Convert logits to log probabilities
    3. Extract the log prob of each actual token
    4. Sum them up (ignoring padding)
    
    Args:
        model: The language model
        input_ids: Token IDs, shape (batch_size, seq_len)
        attention_mask: Which tokens are real vs padding, shape (batch_size, seq_len)
    
    Returns:
        Total log probability for each sequence, shape (batch_size,)
    """
    # Run the model
    outputs = model(input_ids=input_ids, attention_mask=attention_mask)
    logits = outputs.logits
    
    # Shift for next-token prediction
    # Position i in logits predicts position i+1 in input_ids
    # So logits[:, :-1] predicts input_ids[:, 1:]
    shift_logits = logits[:, :-1, :]  # (batch, seq_len-1, vocab_size)
    shift_labels = input_ids[:, 1:]   # (batch, seq_len-1)
    shift_mask = attention_mask[:, 1:]  # (batch, seq_len-1)
    
    # Convert logits to log probabilities
    log_probs = F.log_softmax(shift_logits, dim=-1)  # (batch, seq_len-1, vocab_size)
    
    # Gather the log prob of each actual token
    # For each position, extract log_prob[actual_token_id]
    token_log_probs = torch.gather(
        log_probs,
        dim=-1,
        index=shift_labels.unsqueeze(-1)  # Add vocab dimension
    ).squeeze(-1)  # Remove it: (batch, seq_len-1)
    
    # Mask out padding tokens and sum
    # We only want to sum log probs of real tokens
    masked_log_probs = token_log_probs * shift_mask
    sequence_log_probs = masked_log_probs.sum(dim=-1)  # (batch,)
    
    return sequence_log_probs

# Let's see this in action!
print("Testing get_sequence_log_probs")
print("=" * 60)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}\n")

# Load a model for demonstration
demo_model = AutoModelForCausalLM.from_pretrained("gpt2")
demo_model.to(device)
demo_model.eval()

# Create two test sequences
# One should be much more likely (coherent English)
# The other should be unlikely (gibberish)
test_texts = [
    "The quick brown fox jumps over the lazy dog.",
    "Asdfghjkl qwerty zxcvbnm random gibberish text."
]

inputs = demo_tokenizer(
    test_texts, 
    return_tensors="pt", 
    padding=True, 
    truncation=True, 
    max_length=32
)
inputs = {k: v.to(device) for k, v in inputs.items()}

# Compute log probabilities
with torch.no_grad():
    log_probs = get_sequence_log_probs(
        demo_model, 
        inputs['input_ids'], 
        inputs['attention_mask']
    )

print("Results:")
for text, lp in zip(test_texts, log_probs):
    print(f'  "{text}"')
    print(f'    Log probability: {lp.item():.2f}\n')

print("Key insight:")
print("  - Log probabilities are negative (probabilities are between 0 and 1)")
print("  - HIGHER (less negative) = more likely under the model")
print("  - The coherent sentence should have a higher log prob")
print("  - And it does! The model knows English better than gibberish.")
Testing get_sequence_log_probs
============================================================
Using device: cuda

Results:
  "The quick brown fox jumps over the lazy dog."
    Log probability: -45.81

  "Asdfghjkl qwerty zxcvbnm random gibberish text."
    Log probability: -106.50

Key insight:
  - Log probabilities are negative (probabilities are between 0 and 1)
  - HIGHER (less negative) = more likely under the model
  - The coherent sentence should have a higher log prob
  - And it does! The model knows English better than gibberish.

The DPO Loss Function

Okay, deep breath. This is the heart of DPO.

We have two models:

  1. Policy model (π): The model we’re training

  2. Reference model (π_ref): A frozen copy of the starting model

For each preference pair, we:

  1. Compute how likely the chosen response is under both models

  2. Compute how likely the rejected response is under both models

  3. Compare the ratios

Here’s the formula:

loss = -log(σ(β · (log(π(chosen)) - log(π_ref(chosen)) - log(π(rejected)) + log(π_ref(rejected)))))

Whoa. Let’s break that down in English:

β · (log(π(chosen)) - log(π_ref(chosen)) - log(π(rejected)) + log(π_ref(rejected)))

This is asking: “How much more does the policy prefer the chosen response over the rejected response, compared to the reference model?”

Let me show you the intuition:

  • log(π(chosen)) - log(π_ref(chosen)): How much more likely is chosen under policy vs reference?

  • log(π(rejected)) - log(π_ref(rejected)): How much more likely is rejected under policy vs reference?

  • Subtract them: Policy should favor chosen MORE than it favors rejected

  • Multiply by β: Control how much we penalize diverging from reference

Then we wrap it in a sigmoid and take the negative log. This creates a loss that:

  • Goes down when policy assigns higher probability to chosen

  • Goes down when policy assigns lower probability to rejected

  • Penalizes the policy for diverging too far from the reference

That β parameter? It’s the KL penalty. It keeps your model from going completely off the rails and forgetting everything it learned during pretraining.

Pretty elegant, right?

def train_dpo(policy_model, reference_model, train_loader, config, device):
    """
    The complete DPO training loop.
    
    This is where everything comes together:
    - We iterate through batches of preference pairs
    - Compute log probs under both policy and reference models
    - Calculate the DPO loss
    - Update the policy model
    
    The reference model stays frozen the entire time.
    """
    
    # Standard optimizer setup
    optimizer = torch.optim.AdamW(
        policy_model.parameters(),
        lr=config.learning_rate
    )
    
    # Learning rate scheduler with warmup
    total_steps = len(train_loader) * config.num_epochs
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=config.warmup_steps,
        num_training_steps=total_steps
    )
    
    # Set model modes
    policy_model.train()
    reference_model.eval()  # Never changes!
    
    for epoch in range(config.num_epochs):
        epoch_metrics = {'loss': 0, 'accuracy': 0}
        
        progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}")
        
        for batch in progress_bar:
            # Move batch to device
            batch = {k: v.to(device) for k, v in batch.items()}
            
            # Compute log probs under policy model (trainable)
            policy_chosen_logps = get_sequence_log_probs(
                policy_model,
                batch['chosen_input_ids'],
                batch['chosen_attention_mask']
            )
            policy_rejected_logps = get_sequence_log_probs(
                policy_model,
                batch['rejected_input_ids'],
                batch['rejected_attention_mask']
            )
            
            # Compute log probs under reference model (frozen)
            with torch.no_grad():
                ref_chosen_logps = get_sequence_log_probs(
                    reference_model,
                    batch['chosen_input_ids'],
                    batch['chosen_attention_mask']
                )
                ref_rejected_logps = get_sequence_log_probs(
                    reference_model,
                    batch['rejected_input_ids'],
                    batch['rejected_attention_mask']
                )
            
            # The DPO loss computation
            # Step 1: Compute log ratios (how much more likely under policy vs reference)
            chosen_logratios = policy_chosen_logps - ref_chosen_logps
            rejected_logratios = policy_rejected_logps - ref_rejected_logps
            
            # Step 2: Compute the logits
            # This is β * (chosen advantage - rejected advantage)
            logits = config.beta * (chosen_logratios - rejected_logratios)
            
            # Step 3: Apply sigmoid and negative log
            # We want logits to be positive (chosen preferred over rejected)
            loss = -F.logsigmoid(logits).mean()
            
            # Backward pass and optimization
            optimizer.zero_grad()
            loss.backward()
            
            # Clip gradients to prevent explosions
            torch.nn.utils.clip_grad_norm_(
                policy_model.parameters(), 
                config.max_grad_norm
            )
            
            optimizer.step()
            scheduler.step()
            
            # Track metrics
            # Accuracy: how often does policy prefer chosen over rejected?
            # (If logits > 0, we predicted chosen correctly)
            accuracy = (logits > 0).float().mean()
            
            epoch_metrics['loss'] += loss.item()
            epoch_metrics['accuracy'] += accuracy.item()
            
            # Update progress bar
            progress_bar.set_postfix({
                'loss': f"{loss.item():.4f}",
                'acc': f"{accuracy.item():.2%}"
            })
        
        # End of epoch summary
        avg_loss = epoch_metrics['loss'] / len(train_loader)
        avg_acc = epoch_metrics['accuracy'] / len(train_loader)
        print(f"\nEpoch {epoch+1} Summary:")
        print(f"  Average Loss: {avg_loss:.4f}")
        print(f"  Average Accuracy: {avg_acc:.2%}")
    
    return policy_model

Setting Up For Training

Time to actually run DPO!

We need:

  1. A reference model (frozen - this is our anchor)

  2. A policy model (trainable - this is what we’re improving)

  3. Preference data (chosen vs rejected pairs)

The reference model is crucial. Without it, the policy could drift into nonsense that happens to score well on our loss. The reference keeps us grounded.

# Setup
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}\n")

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
print("Tokenizer loaded\n")

# Load reference model (this stays frozen!)
print("Loading reference model...")
reference_model = AutoModelForCausalLM.from_pretrained("gpt2")

# Freeze all parameters
for param in reference_model.parameters():
    param.requires_grad = False

reference_model.to(device)
reference_model.eval()
print("  Reference model: FROZEN (this is our anchor)\n")

# Load policy model (this gets trained!)
print("Loading policy model...")
policy_model = AutoModelForCausalLM.from_pretrained("gpt2")
policy_model.to(device)
print("  Policy model: TRAINABLE (this is what we improve)")

# Count trainable parameters
num_params = sum(p.numel() for p in policy_model.parameters() if p.requires_grad)
print(f"  Trainable parameters: {num_params:,}\n")

print("Models ready! Reference frozen, policy ready to learn.")
Using device: cuda

Tokenizer loaded

Loading reference model...
  Reference model: FROZEN (this is our anchor)

Loading policy model...
  Policy model: TRAINABLE (this is what we improve)
  Trainable parameters: 124,439,808

Models ready! Reference frozen, policy ready to learn.
# Load the preference dataset
print("Loading preference data...")
from datasets import load_dataset

# Anthropic's HH-RLHF dataset (same one we used for reward modeling!)
raw_data = load_dataset("Anthropic/hh-rlhf", split="train")

# We'll use a small subset for this demo
# (In practice, you'd use more data)
raw_data = raw_data.select(range(500))
print(f"  Loaded {len(raw_data)} preference pairs\n")

# Wrap it in our DPO dataset
dpo_dataset = DPODataset(raw_data, tokenizer, max_length=256)

# Create dataloader
train_loader = DataLoader(
    dpo_dataset, 
    batch_size=config.batch_size, 
    shuffle=True
)

print(f"DataLoader ready:")
print(f"  Batch size: {config.batch_size}")
print(f"  Number of batches: {len(train_loader)}")
print(f"  Total training steps: {len(train_loader) * config.num_epochs}")
Loading preference data...
  Loaded 500 preference pairs

DataLoader ready:
  Batch size: 4
  Number of batches: 125
  Total training steps: 125
# Run DPO training!
print("Starting DPO Training")
print("=" * 60)
print("\nWhat's happening:")
print("  - Policy model learns to prefer chosen over rejected")
print("  - Reference model keeps us from drifting too far")
print("  - β controls the tradeoff\n")

policy_model = train_dpo(
    policy_model, 
    reference_model, 
    train_loader, 
    config, 
    device
)

print("\n" + "=" * 60)
print("DPO training complete!")
print("\nThe policy model now:")
print("  ✓ Assigns higher probability to preferred responses")
print("  ✓ Assigns lower probability to dispreferred responses")
print("  ✓ Stays grounded by the reference model")
Starting DPO Training
============================================================

What's happening:
  - Policy model learns to prefer chosen over rejected
  - Reference model keeps us from drifting too far
  - β controls the tradeoff

Loading...

Epoch 1 Summary:
  Average Loss: 0.9530
  Average Accuracy: 56.80%

============================================================
DPO training complete!

The policy model now:
  ✓ Assigns higher probability to preferred responses
  ✓ Assigns lower probability to dispreferred responses
  ✓ Stays grounded by the reference model

Understanding the Key Hyperparameters

Let’s talk about the knobs you can turn and what they actually do.

β (Beta) - The KL Penalty Coefficient

Default: 0.1

This is the most important hyperparameter in DPO.

Think of it as a leash. Higher β = shorter leash. The policy can’t stray as far from the reference model.

  • Too high (β = 1.0): Policy barely changes, stuck close to reference

  • Too low (β = 0.01): Policy might overfit to preferences, forget general knowledge

  • Just right (β = 0.1-0.5): Sweet spot for most tasks

How do you know if β is right? If your model starts generating nonsense or forgetting basic facts, β is probably too low.

Learning Rate

Default: 1e-6

Tiny. Like, really tiny.

This is similar to RLHF - we’re making small, careful updates. Language models are sensitive, and DPO is pushing them in a specific direction. Go too fast and you’ll overshoot.

Start with 1e-6. If training is too slow (loss barely moving), try 5e-6. If it’s unstable (loss jumping around), try 5e-7.

Number of Epochs

Default: 1-3

With preference data, you can overfit fast.

Unlike pretraining where more data is always better, with DPO you’re teaching specific preferences. Too many epochs and the model memorizes the training set instead of learning general principles.

One epoch is often enough. Three is usually the max. If you need more, you probably need more diverse data, not more epochs.

Batch Size

Default: 4-32

Standard deep learning advice applies: bigger batches = more stable gradients = faster training.

But: bigger batches = more memory. If you’re GPU-limited, use smaller batches and accumulate gradients.

The actual batch size matters less than getting enough total training steps.

What We Just Built

Take a step back and appreciate what just happened.

We trained a language model to align with human preferences without:

  • Training a separate reward model

  • Running reinforcement learning

  • Dealing with PPO’s complexity and instability

Instead, we:

  1. Kept a frozen copy of the starting model (reference)

  2. Trained a new copy (policy) to prefer chosen over rejected responses

  3. Used β to prevent the policy from drifting too far

The math is elegant. The implementation is straightforward. The results are competitive with (and sometimes better than) full RLHF.

This is why DPO has become so popular. It democratizes alignment - you don’t need an RL expert on your team to make it work.

When to Use DPO

DPO shines when:

  • You have good preference data

  • You want alignment without RL complexity

  • You’re fine-tuning a model that’s already been supervised fine-tuned

  • You want something stable and predictable

DPO might not be ideal when:

  • Your preference data is noisy (reward models can be more robust)

  • You need complex reward shaping

  • You want to combine multiple objectives

Next Steps

From here you could:

  • Experiment with different β values to see how they affect the tradeoff

  • Try DPO on your own preference dataset

  • Combine DPO with LoRA for parameter-efficient training

  • Evaluate your DPO model on a held-out test set

The beauty of DPO is that it’s just supervised learning with a clever loss function. Everything you know about training neural networks still applies.