Introduction to DPO (Direct Preference Optimization)

Remember RLHF? That whole pipeline with supervised fine-tuning, then training a reward model, then doing reinforcement learning with PPO?

Yeah. What if I told you there’s a simpler way?

What is DPO?¶

DPO stands for Direct Preference Optimization. Let’s break down what each word actually means:

Direct — We skip the middleman (the reward model) and optimize directly on preferences
Preference — We’re still using the same kind of data: “this response is better than that one”
Optimization — We’re training a model to get better at something (generating preferred responses)

DPO accomplishes the same goal as RLHF (aligning models with human preferences), but it does it with a completely different approach.

Instead of:

Train a reward model to predict which responses are good
Use reinforcement learning to make the language model chase high rewards

We just:

Train the language model directly on preference pairs

That’s it. One step instead of two.

(You might be thinking: “Wait, if it’s that much simpler, why did we do RLHF first?” Great question! RLHF came first historically, and DPO is a more recent mathematical insight that shows we can skip some steps.)

DPO vs RLHF: The Practical Differences¶

Let’s make this concrete. Here’s what you need for each approach:

Aspect	RLHF	DPO
Pipeline	SFT → Reward Model → PPO	SFT → DPO
Models in memory	4 copies (policy, value, reward, reference)	2 copies (policy, reference)
Training complexity	High (RL is tricky)	Low (supervised learning)
Training type	Reinforcement learning	Classification-style loss
Stability	Can be unstable (reward hacking!)	Generally stable
Memory needed	~4x your base model size	~2x your base model size

The memory difference is huge if you’re training large models. With RLHF, training a 7B model means you need roughly 28GB of model weights in VRAM (4 models × 7B parameters). With DPO, that drops to 14GB (2 models × 7B).

Also, no reinforcement learning means no worrying about whether your RL algorithm is converging properly, whether the reward model is being “hacked” by the policy, or any of the other headaches that come with RL.

DPO is just... simpler. And simpler is often better.

The Key Insight (Or: The Math That Makes It All Work)¶

Okay, here comes the clever bit. The DPO paper showed something beautiful: the optimal policy under the RLHF objective has a closed-form solution.

In math terms:

\pi^*(y|x) = \frac{1}{Z(x)} \pi_{\text{ref}}(y|x) \exp\left(\frac{r(x,y)}{\beta}\right)

(1)

Let me translate that into English:

$\pi^*(y|x)$ — The optimal policy: the probability our best possible model assigns to response $y$ given prompt $x$
$\pi_{\text{ref}}(y|x)$ — The reference policy: probability our starting model assigns to the same response
$r(x,y)$ — The reward: how good is response $y$ for prompt $x$ ?
$\beta$ — Temperature parameter: controls how much we trust the reward vs staying close to the reference
$Z(x)$ — Normalization constant: makes probabilities sum to 1 (we can basically ignore this)

The equation says: the optimal policy is the reference policy, adjusted by exponentiating the reward.

We can rearrange this equation to solve for the reward:

r(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{\text{ref}}(y|x)} + \beta \log Z(x)

(2)

Wait. Stop. Look at that.

The reward is just the log-ratio between the optimal policy and the reference policy (plus a constant we can ignore).

This means: if we train a policy to match preferences, we’re implicitly defining a reward model. We don’t need to train a separate reward model at all!

We can directly optimize the policy to prefer better responses over worse ones.

The DPO Loss Function¶

Alright, so how do we actually train with DPO? Here’s the loss function:

\mathcal{L}_{\text{DPO}}(\pi_\theta; \pi_{\text{ref}}) = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma \left( \beta \log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)} \right) \right]

(3)

That looks intimidating, but let’s break it down piece by piece:

$\mathcal{L}_{\text{DPO}}$ — The DPO loss we’re trying to minimize
$\pi_\theta$ — Our policy (the model we’re training), with parameters $\theta$
$\pi_{\text{ref}}$ — The reference model (frozen, not updated)
$x$ — The prompt (input to the model)
$y_w$ — The “winning” response (the one humans preferred)
$y_l$ — The “losing” response (the one humans rejected)
$\beta$ — Temperature parameter (same as before)
$\sigma$ — The sigmoid function: $\sigma(z) = \frac{1}{1 + e^{-z}}$
$\mathbb{E}$ — Expected value (average over all our training examples)

The core idea:

Compute how much more our policy likes the winning response vs the reference model: $\log \frac{\pi_\theta(y_w|x)}{\pi_{\text{ref}}(y_w|x)}$
Compute how much more our policy likes the losing response vs the reference model: $\log \frac{\pi_\theta(y_l|x)}{\pi_{\text{ref}}(y_l|x)}$
Take the difference (we want the first to be bigger than the second)
Pass through sigmoid to get a probability
Take the log and negate it (standard cross-entropy loss trick)

In plain English: we’re training the model so that the log-ratio for the winning response is larger than the log-ratio for the losing response.

The model learns to increase the probability of good responses (relative to the reference) and decrease the probability of bad responses (relative to the reference).

import torch
import torch.nn.functional as F

def compute_dpo_loss(
    policy_chosen_logps: torch.Tensor,
    policy_rejected_logps: torch.Tensor,
    reference_chosen_logps: torch.Tensor,
    reference_rejected_logps: torch.Tensor,
    beta: float = 0.1
) -> torch.Tensor:
    """
    Compute DPO loss.
    
    This is the core of DPO training. We take log probabilities from both
    the policy (trainable) and reference (frozen) models, then compute
    a loss that encourages the policy to prefer chosen over rejected responses.
    
    Args:
        policy_chosen_logps: Log probs of chosen responses under policy
        policy_rejected_logps: Log probs of rejected responses under policy
        reference_chosen_logps: Log probs of chosen responses under reference
        reference_rejected_logps: Log probs of rejected responses under reference
        beta: Temperature parameter (controls strength of KL penalty)
    
    Returns:
        DPO loss (scalar)
    """
    # Compute log ratios for chosen and rejected responses
    # These tell us: how much more does the policy like this response vs the reference?
    chosen_logratios = policy_chosen_logps - reference_chosen_logps
    rejected_logratios = policy_rejected_logps - reference_rejected_logps
    
    # The logits for our binary classification:
    # We want chosen_logratios > rejected_logratios
    logits = beta * (chosen_logratios - rejected_logratios)
    
    # Standard binary cross-entropy via log-sigmoid
    # logsigmoid(x) = log(1 / (1 + exp(-x))) = -log(1 + exp(-x))
    loss = -F.logsigmoid(logits).mean()
    
    return loss

# Example: let's create some fake log probabilities
# (In reality these come from actually running the model, but we'll simulate them)
batch_size = 4

# Policy log probs (our trainable model)
# More negative = lower probability (log of a small number)
policy_chosen = torch.tensor([-50.0, -45.0, -55.0, -48.0])
policy_rejected = torch.tensor([-52.0, -48.0, -58.0, -46.0])  # Note: sometimes policy is confused!

# Reference log probs (frozen initial model)
ref_chosen = torch.tensor([-51.0, -46.0, -56.0, -49.0])
ref_rejected = torch.tensor([-51.0, -46.0, -56.0, -49.0])  # Reference assigns similar probs

loss = compute_dpo_loss(policy_chosen, policy_rejected, ref_chosen, ref_rejected, beta=0.1)
print(f"DPO Loss: {loss.item():.4f}")
print()

# Let's understand what's happening:
print("Breaking down the loss computation:")
print("=" * 60)
for i in range(batch_size):
    chosen_ratio = policy_chosen[i] - ref_chosen[i]
    rejected_ratio = policy_rejected[i] - ref_rejected[i]
    diff = chosen_ratio - rejected_ratio
    print(f"Example {i+1}:")
    print(f"  Chosen log-ratio:   {chosen_ratio.item():6.2f} (policy vs ref for good response)")
    print(f"  Rejected log-ratio: {rejected_ratio.item():6.2f} (policy vs ref for bad response)")
    print(f"  Difference:         {diff.item():6.2f} ({'✓ good' if diff > 0 else '✗ bad - policy prefers rejected!'})")
    print()

DPO Loss: 0.6262

Breaking down the loss computation:
============================================================
Example 1:
  Chosen log-ratio:     1.00 (policy vs ref for good response)
  Rejected log-ratio:  -1.00 (policy vs ref for bad response)
  Difference:           2.00 (✓ good)

Example 2:
  Chosen log-ratio:     1.00 (policy vs ref for good response)
  Rejected log-ratio:  -2.00 (policy vs ref for bad response)
  Difference:           3.00 (✓ good)

Example 3:
  Chosen log-ratio:     1.00 (policy vs ref for good response)
  Rejected log-ratio:  -2.00 (policy vs ref for bad response)
  Difference:           3.00 (✓ good)

Example 4:
  Chosen log-ratio:     1.00 (policy vs ref for good response)
  Rejected log-ratio:   3.00 (policy vs ref for bad response)
  Difference:          -2.00 (✗ bad - policy prefers rejected!)

How DPO Works: The Big Picture¶

Let me walk you through the training process:

Input: Preference pairs in the format (prompt, chosen_response, rejected_response)

Setup:

Start with your policy model (this is what we’ll train)
Make a frozen copy to use as the reference model (this stays fixed)

Training loop:

For each batch of preference pairs:
- Run the policy model on both chosen and rejected responses → get log probabilities
- Run the reference model on both chosen and rejected responses → get log probabilities
- Compute the DPO loss (encourages policy to prefer chosen over rejected, relative to reference)
- Backpropagate and update the policy model
Repeat until the model learns to prefer better responses

Key insight: The reference model provides an anchor. Without it, the model could just assign probability 1.0 to chosen responses and 0.0 to rejected ones. The reference keeps the policy from drifting too far from the original behavior (this is implicitly a KL divergence penalty).

Here’s a simple diagram:

┌────────────────────────────────────────────────────────────┐
│                      DPO Training                           │
├────────────────────────────────────────────────────────────┤
│                                                            │
│  Input: (prompt, chosen, rejected)                         │
│                                                            │
│  ┌─────────────┐              ┌─────────────┐             │
│  │   Policy    │              │  Reference  │             │
│  │  πθ (train) │              │  πref (frozen)            │
│  └──────┬──────┘              └──────┬──────┘             │
│         │                            │                     │
│    Run on both                  Run on both               │
│  chosen & rejected              chosen & rejected          │
│         │                            │                     │
│    log P(chosen)                log P(chosen)              │
│    log P(rejected)              log P(rejected)            │
│         │                            │                     │
│         └──────────┬─────────────────┘                     │
│                    │                                       │
│         Compute log-ratios                                 │
│    (policy vs reference)                                   │
│                    │                                       │
│         Compare chosen vs rejected                         │
│                    │                                       │
│            DPO Loss                                        │
│         (want chosen > rejected)                           │
│                    │                                       │
│           Backprop & update                                │
│            policy only                                     │
│                                                            │
└────────────────────────────────────────────────────────────┘

The Implicit Reward Model¶

Here’s something cool: even though we don’t train a separate reward model, DPO implicitly defines one.

Remember that mathematical rearrangement from earlier? We can extract an implicit reward at any time:

r(x, y) = \beta \log \frac{\pi_\theta(y|x)}{\pi_{\text{ref}}(y|x)}

(4)

In English: the reward for a response is just the log-ratio between how much the policy likes it vs how much the reference likes it, scaled by $\beta$ .

This means:

If the policy assigns higher probability than the reference → positive reward
If the policy assigns lower probability than the reference → negative reward
If they assign the same probability → zero reward

The reward model is baked into the policy itself. No need for a separate model!

def compute_implicit_reward(
    policy_logps: torch.Tensor,
    reference_logps: torch.Tensor,
    beta: float = 0.1
) -> torch.Tensor:
    """
    Compute the implicit reward under DPO.
    
    This extracts the "reward" that DPO is implicitly optimizing.
    Even though we never train a separate reward model, we can
    compute what the reward would be for any response.
    
    Args:
        policy_logps: Log probabilities under the policy
        reference_logps: Log probabilities under the reference
        beta: Temperature parameter
    
    Returns:
        Implicit rewards
    """
    return beta * (policy_logps - reference_logps)

# Let's compute the implicit rewards for our earlier examples
print("Implicit rewards for chosen responses:")
print("=" * 60)
implicit_reward_chosen = compute_implicit_reward(policy_chosen, ref_chosen, beta=0.1)
for i in range(batch_size):
    print(f"Example {i+1}: {implicit_reward_chosen[i].item():+.3f}")
    if implicit_reward_chosen[i] > 0:
        print(f"  → Policy likes this MORE than reference (good!)")
    elif implicit_reward_chosen[i] < 0:
        print(f"  → Policy likes this LESS than reference (needs more training)")
    else:
        print(f"  → Policy and reference agree")

print()
print("Implicit rewards for rejected responses:")
print("=" * 60)
implicit_reward_rejected = compute_implicit_reward(policy_rejected, ref_rejected, beta=0.1)
for i in range(batch_size):
    print(f"Example {i+1}: {implicit_reward_rejected[i].item():+.3f}")
    if implicit_reward_rejected[i] < 0:
        print(f"  → Policy likes this LESS than reference (good!)")
    elif implicit_reward_rejected[i] > 0:
        print(f"  → Policy likes this MORE than reference (bad - still learning)")
    else:
        print(f"  → Policy and reference agree")

print()
print("Key insight: We want chosen rewards > rejected rewards!")
print(f"Average chosen reward:   {implicit_reward_chosen.mean().item():+.3f}")
print(f"Average rejected reward: {implicit_reward_rejected.mean().item():+.3f}")

Implicit rewards for chosen responses:
============================================================
Example 1: +0.100
  → Policy likes this MORE than reference (good!)
Example 2: +0.100
  → Policy likes this MORE than reference (good!)
Example 3: +0.100
  → Policy likes this MORE than reference (good!)
Example 4: +0.100
  → Policy likes this MORE than reference (good!)

Implicit rewards for rejected responses:
============================================================
Example 1: -0.100
  → Policy likes this LESS than reference (good!)
Example 2: -0.200
  → Policy likes this LESS than reference (good!)
Example 3: -0.200
  → Policy likes this LESS than reference (good!)
Example 4: +0.300
  → Policy likes this MORE than reference (bad - still learning)

Key insight: We want chosen rewards > rejected rewards!
Average chosen reward:   +0.100
Average rejected reward: -0.050

When Should You Use DPO vs RLHF?¶

This is the practical question, right? Here’s my take:

Use DPO when:

You want simplicity — Fewer moving parts, easier to debug, less can go wrong
Memory is tight — You’re training a large model and can’t afford 4 copies in VRAM
You value stability — You don’t want to deal with RL training dynamics
You have good preference data — DPO is only as good as your (prompt, chosen, rejected) pairs
You’re just getting started — DPO is easier to understand and implement

Use RLHF when:

You need to iterate on rewards — Sometimes you want to tweak the reward model without retraining everything
You have a lot of unlabeled prompts — RLHF can generate responses and learn from them (online RL)
Maximum control — You want fine-grained control over the reward function
You’re already using it — If RLHF is working for you, no need to switch!

Honestly? For most people, most of the time, DPO is the better choice. It’s simpler, it’s more stable, and it gets you 90% of the way there with 50% of the complexity.

(That said, the big AI labs still use RLHF variants for their flagship models. They have the resources to handle the complexity and want maximum control. You probably don’t need that.)

What’s Next?¶

We’ve covered the high-level ideas behind DPO. In the following notebooks, we’ll dive deeper:

DPO vs RLHF — A detailed comparison of both approaches, mathematically and practically
DPO Loss — Deep dive into the loss function, its gradients, and what it’s actually optimizing
DPO Training — Complete implementation: loading models, preparing data, training loop, evaluation

By the end, you’ll understand not just what DPO is, but why it works and how to use it.

Let’s go!