Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Training Reward Models

Alright, we’ve got our preference data. We’ve got our model architecture.

Now comes the fun part: training.

How do you actually teach a neural network to predict which response humans will prefer? Turns out, there’s some beautiful math behind it (and once you understand it, you’ll wonder how it ever seemed complicated).

The Ranking Loss (And Why It Works)

When training reward models, we’re not trying to predict a specific number. We’re trying to predict rankings.

Think about it. When humans judge responses, they don’t say “this response deserves exactly 7.3 points out of 10.” They say “this one is better than that one.” Rankings are natural. Absolute scores? Not so much.

So our loss function needs to reflect that. Enter the Bradley-Terry ranking loss:

LRM=E(x,yw,yl)D[logσ(rθ(x,yw)rθ(x,yl))]\mathcal{L}_{\text{RM}} = -\mathbb{E}_{(x, y_w, y_l) \sim D} \left[ \log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) \right]

Okay, I know. Math notation can be intimidating. Let’s break this down piece by piece:

  • xx = the prompt (the question or instruction)

  • ywy_w = the winner response (the one humans preferred)

  • yly_l = the loser response (the one humans rejected)

  • rθ(x,y)r_\theta(x, y) = our reward model’s score for response yy given prompt xx

  • σ(z)=11+ez\sigma(z) = \frac{1}{1 + e^{-z}} = the sigmoid function (squashes any number to between 0 and 1)

What this formula actually means:

We want the probability that ywy_w ranks higher than yly_l to be as close to 1 as possible. That probability is σ(rθ(x,yw)rθ(x,yl))\sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) — the sigmoid of the difference in rewards.

If the winner’s reward is much higher than the loser’s, the difference is large and positive, sigmoid returns something close to 1, and our loss is low. Good!

If the loser’s reward is somehow higher (model got it backwards), the difference is negative, sigmoid returns something close to 0, and our loss shoots up. Bad! The model needs to learn.

Why sigmoid? It converts reward differences into probabilities. A difference of 0 → 50% chance. Large positive difference → near 100% chance. Large negative difference → near 0% chance.

Why negative log? Because we’re minimizing loss. We want to maximize the probability, which means minimizing the negative log of the probability. (Classic machine learning trick.)

The beauty is that this loss function doesn’t care about the absolute values of the rewards. Only their relative ordering. Perfect for our task.

import torch
import torch.nn.functional as F

def compute_ranking_loss(
    chosen_rewards: torch.Tensor,
    rejected_rewards: torch.Tensor,
    margin: float = 0.0
) -> torch.Tensor:
    """
    Compute ranking loss for reward model training.
    
    This is the Bradley-Terry ranking loss: -log(sigmoid(r_chosen - r_rejected))
    
    The goal: make chosen_rewards > rejected_rewards by a comfortable margin.
    
    Args:
        chosen_rewards: Rewards for chosen responses, shape (batch_size,)
        rejected_rewards: Rewards for rejected responses, shape (batch_size,)
        margin: Optional margin to enforce minimum difference (usually 0)
    
    Returns:
        Ranking loss (scalar)
    """
    # Compute the difference: chosen should be higher than rejected
    # If margin > 0, we require chosen to be higher by at least margin
    logits = chosen_rewards - rejected_rewards - margin
    
    # Apply log-sigmoid for numerical stability
    # -log(sigmoid(x)) = log(1 + exp(-x)) = softplus(-x)
    # (PyTorch's softplus is more numerically stable than manually computing log(sigmoid))
    loss = F.softplus(-logits)
    
    return loss.mean()

# Let's see it in action!
print("Example: Computing ranking loss")
print("=" * 50)

# Create a batch of 4 examples
batch_size = 4
chosen_rewards = torch.tensor([2.0, 1.5, 3.0, 0.5])
rejected_rewards = torch.tensor([1.0, 1.0, 2.0, 1.0])

print("\nChosen rewards:  ", chosen_rewards.tolist())
print("Rejected rewards:", rejected_rewards.tolist())
print("\nDifferences:     ", (chosen_rewards - rejected_rewards).tolist())

loss = compute_ranking_loss(chosen_rewards, rejected_rewards)
print(f"\nRanking loss: {loss.item():.4f}")

# Accuracy: how often is chosen > rejected?
accuracy = (chosen_rewards > rejected_rewards).float().mean()
print(f"Accuracy: {accuracy.item():.2%}")

print("\nNotice:")
print("  - Examples 1-3: chosen > rejected → contributes low loss")
print("  - Example 4: chosen < rejected → contributes high loss (model wrong!)")
print("  - Overall accuracy is 75% (3 out of 4 correct)")

# Let's look at the individual losses
individual_losses = F.softplus(-(chosen_rewards - rejected_rewards))
print("\nIndividual losses per example:")
for i, (loss_val, diff) in enumerate(zip(individual_losses, chosen_rewards - rejected_rewards)):
    print(f"  Example {i+1}: diff={diff:+.1f} → loss={loss_val:.4f}")
Example: Computing ranking loss
==================================================

Chosen rewards:   [2.0, 1.5, 3.0, 0.5]
Rejected rewards: [1.0, 1.0, 2.0, 1.0]

Differences:      [1.0, 0.5, 1.0, -0.5]

Ranking loss: 0.5187
Accuracy: 75.00%

Notice:
  - Examples 1-3: chosen > rejected → contributes low loss
  - Example 4: chosen < rejected → contributes high loss (model wrong!)
  - Overall accuracy is 75% (3 out of 4 correct)

Individual losses per example:
  Example 1: diff=+1.0 → loss=0.3133
  Example 2: diff=+0.5 → loss=0.4741
  Example 3: diff=+1.0 → loss=0.3133
  Example 4: diff=-0.5 → loss=0.9741

Training Metrics: What to Watch

When training a reward model, you need to track a few key metrics. Think of them as your dashboard while driving — they tell you if you’re on the right track or about to drive off a cliff.

MetricWhat It MeasuresWhat You Want to See
LossHow wrong the model’s rankings areDecreasing over time
Accuracy% of pairs ranked correctly> 70% (ideally 80%+)
Mean MarginAverage difference between chosen and rejectedPositive and increasing

Loss is your primary signal. Lower is better. If it’s not decreasing, something’s wrong.

Accuracy is more interpretable. If your model ranks chosen responses higher than rejected responses 80% of the time, that’s pretty good! (Humans don’t even agree 100% of the time.)

Mean Margin tells you how confident the model is. A margin of +0.1 means the model barely prefers the chosen response. A margin of +3.0 means it really prefers it. You want this to grow during training.

def compute_ranking_loss_with_metrics(
    chosen_rewards: torch.Tensor,
    rejected_rewards: torch.Tensor,
    margin: float = 0.0
) -> dict:
    """
    Compute ranking loss AND all the useful metrics you want to track.
    
    This is what you'd call during training to get a full picture of how
    your model is performing.
    """
    loss = compute_ranking_loss(chosen_rewards, rejected_rewards, margin)
    
    # Accuracy: how often does model rank chosen higher?
    accuracy = (chosen_rewards > rejected_rewards).float().mean()
    
    # Mean rewards (useful for debugging)
    mean_chosen = chosen_rewards.mean()
    mean_rejected = rejected_rewards.mean()
    
    # Mean margin: how much higher is chosen vs rejected on average?
    mean_margin = (chosen_rewards - rejected_rewards).mean()
    
    return {
        "loss": loss,
        "accuracy": accuracy,
        "mean_chosen_reward": mean_chosen,
        "mean_rejected_reward": mean_rejected,
        "mean_margin": mean_margin,
    }

# Let's use the same example from before
metrics = compute_ranking_loss_with_metrics(chosen_rewards, rejected_rewards)

print("Training metrics for our example batch:")
print("=" * 50)
for k, v in metrics.items():
    if k == "loss":
        print(f"  {k}: {v.item():.4f} (lower is better)")
    elif k == "accuracy":
        print(f"  {k}: {v.item():.2%} (higher is better)")
    elif k == "mean_margin":
        print(f"  {k}: {v.item():.4f} (positive and growing is good)")
    else:
        print(f"  {k}: {v.item():.4f}")

print("\nInterpretation:")
print("  - Accuracy of 75% is decent (3 out of 4 correct)")
print("  - Mean margin of +0.5 means chosen responses score 0.5 higher on average")
print("  - During training, you'd watch these metrics improve over time")
Training metrics for our example batch:
==================================================
  loss: 0.5187 (lower is better)
  accuracy: 75.00% (higher is better)
  mean_chosen_reward: 1.7500
  mean_rejected_reward: 1.2500
  mean_margin: 0.5000 (positive and growing is good)

Interpretation:
  - Accuracy of 75% is decent (3 out of 4 correct)
  - Mean margin of +0.5 means chosen responses score 0.5 higher on average
  - During training, you'd watch these metrics improve over time

Putting It All Together: The Training Loop

Okay, we’ve got our loss function. We’ve got our metrics. Now let’s talk about the actual training loop.

At a high level, training a reward model looks like this:

  1. Load a batch of (prompt, chosen, rejected) triples from your dataset

  2. Forward pass the chosen responses through the model → get chosen rewards

  3. Forward pass the rejected responses through the model → get rejected rewards

  4. Compute the ranking loss (and metrics for monitoring)

  5. Backpropagate the loss through the model

  6. Update the weights with your optimizer

  7. Repeat until the model learns to rank preferences like a human

It’s the same training loop you’ve seen before (if you’ve trained any neural network). The only special part is step 4 — using ranking loss instead of, say, cross-entropy.

Let’s build it.

import torch.nn as nn
from torch.utils.data import DataLoader
from transformers import AutoModel, AutoTokenizer, get_linear_schedule_with_warmup
from tqdm.auto import tqdm

class RewardModel(nn.Module):
    """
    A reward model for predicting human preferences.
    
    Architecture is simple:
    - Take a base language model (GPT-2, Llama, whatever)
    - Add a "value head" on top that projects to a single scalar
    - That scalar is the reward
    
    The base model processes the text and extracts meaning.
    The value head says "based on this meaning, how good is this response?"
    """
    
    def __init__(self, base_model, hidden_size):
        super().__init__()
        self.base_model = base_model
        
        # The value head: dropout for regularization, then linear layer to scalar
        self.value_head = nn.Sequential(
            nn.Dropout(0.1),  # 10% dropout to prevent overfitting
            nn.Linear(hidden_size, 1)  # hidden_size → 1 scalar reward
        )
    
    def get_rewards(self, input_ids, attention_mask):
        """
        Compute reward scores for input sequences.
        
        Args:
            input_ids: Token IDs, shape (batch_size, seq_len)
            attention_mask: Attention mask, shape (batch_size, seq_len)
        
        Returns:
            Reward scores, shape (batch_size,)
        """
        # Run the base model to get hidden states
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True
        )
        hidden_states = outputs.last_hidden_state  # (batch_size, seq_len, hidden_size)
        
        # Get the LAST non-padding token's hidden state for each sequence
        # (This is where the model has seen the entire response)
        seq_lengths = attention_mask.sum(dim=1) - 1  # -1 because of 0-indexing
        batch_size = hidden_states.shape[0]
        last_hidden = hidden_states[
            torch.arange(batch_size, device=hidden_states.device),
            seq_lengths.long()
        ]
        
        # Project to scalar and squeeze out the last dimension
        return self.value_head(last_hidden).squeeze(-1)

# Let's create a reward model and see it in action
print("Creating a reward model from GPT-2...")
print("=" * 50)

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

# Load base model and tokenizer
base_model = AutoModel.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 doesn't have a pad token by default

# Create reward model
reward_model = RewardModel(base_model, hidden_size=768)  # GPT-2 has 768 hidden dims
reward_model.to(device)

print(f"\nReward model created!")
print(f"  Device: {device}")
print(f"  Base model: GPT-2 (124M parameters)")
print(f"  Hidden size: 768")
print(f"  Value head: 768 → 1 (just 769 parameters!)")

# Test it with a forward pass
test_texts = [
    "This is a helpful and informative response that answers the question clearly.",
    "I don't know, just Google it yourself."
]

print(f"\nTesting with example responses...")
inputs = tokenizer(test_texts, return_tensors="pt", padding=True, truncation=True)
inputs = {k: v.to(device) for k, v in inputs.items()}

with torch.no_grad():
    rewards = reward_model.get_rewards(inputs['input_ids'], inputs['attention_mask'])

print(f"\nTest rewards (before training, so basically random):")
for text, reward in zip(test_texts, rewards):
    print(f"  [{reward.item():+.4f}] \"{text[:60]}...\"")
    
print("\nNote: Rewards are random right now because the model is untrained.")
print("After training, the first response should get a higher reward!")
Creating a reward model from GPT-2...
==================================================

Reward model created!
  Device: cuda
  Base model: GPT-2 (124M parameters)
  Hidden size: 768
  Value head: 768 → 1 (just 769 parameters!)

Testing with example responses...

Test rewards (before training, so basically random):
  [-2.3981] "This is a helpful and informative response that answers the ..."
  [-2.3173] "I don't know, just Google it yourself...."

Note: Rewards are random right now because the model is untrained.
After training, the first response should get a higher reward!
def train_reward_model(model, train_loader, eval_loader, config, device):
    """
    Complete reward model training loop.
    
    For each (prompt, chosen, rejected) triple in the dataset:
    - Compute rewards for both chosen and rejected
    - Calculate ranking loss: we want chosen > rejected
    - Backprop and update weights
    
    Args:
        model: RewardModel instance
        train_loader: DataLoader for training data
        eval_loader: DataLoader for evaluation (can be None)
        config: Training configuration dict
        device: torch device
    
    Returns:
        Trained model
    """
    
    # AdamW optimizer with weight decay (helps prevent overfitting)
    optimizer = torch.optim.AdamW(
        model.parameters(),
        lr=config['learning_rate'],
        weight_decay=0.01
    )
    
    # Learning rate scheduler: linear warmup then decay
    total_steps = len(train_loader) * config['num_epochs']
    scheduler = get_linear_schedule_with_warmup(
        optimizer,
        num_warmup_steps=config['warmup_steps'],
        num_training_steps=total_steps
    )
    
    model.train()
    best_eval_accuracy = 0.0
    
    for epoch in range(config['num_epochs']):
        epoch_metrics = {'loss': 0, 'accuracy': 0}
        
        progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}")
        
        for batch in progress_bar:
            # Move batch to device
            batch = {k: v.to(device) for k, v in batch.items()}
            
            # Forward pass for chosen responses
            chosen_rewards = model.get_rewards(
                batch['chosen_input_ids'],
                batch['chosen_attention_mask']
            )
            
            # Forward pass for rejected responses
            rejected_rewards = model.get_rewards(
                batch['rejected_input_ids'],
                batch['rejected_attention_mask']
            )
            
            # Compute loss and metrics
            metrics = compute_ranking_loss_with_metrics(
                chosen_rewards, rejected_rewards
            )
            loss = metrics['loss']
            
            # Backward pass
            optimizer.zero_grad()
            loss.backward()
            
            # Gradient clipping (prevents exploding gradients)
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            
            optimizer.step()
            scheduler.step()
            
            # Update metrics
            epoch_metrics['loss'] += loss.item()
            epoch_metrics['accuracy'] += metrics['accuracy'].item()
            
            progress_bar.set_postfix({
                'loss': f"{loss.item():.4f}",
                'acc': f"{metrics['accuracy'].item():.2%}"
            })
        
        # End of epoch
        avg_loss = epoch_metrics['loss'] / len(train_loader)
        avg_acc = epoch_metrics['accuracy'] / len(train_loader)
        print(f"Epoch {epoch+1} - Loss: {avg_loss:.4f}, Accuracy: {avg_acc:.2%}")
    
    return model

# Let's demonstrate what happens during a training step
print("Demonstrating a single training step")
print("=" * 50)

# Put the model in training mode
reward_model.train()
optimizer = torch.optim.AdamW(reward_model.parameters(), lr=1e-5)

# Create a synthetic training example
# Chosen: a helpful, informative response
# Rejected: a dismissive, unhelpful response
batch_chosen = tokenizer(
    ["The answer is 42. This is the result from Douglas Adams' novel 'The Hitchhiker's Guide to the Galaxy', where a supercomputer calculated it as the answer to the ultimate question of life, the universe, and everything."],
    return_tensors="pt", padding=True, truncation=True, max_length=64
)
batch_rejected = tokenizer(
    ["I don't know. Just look it up yourself."],
    return_tensors="pt", padding=True, truncation=True, max_length=64
)

batch_chosen = {k: v.to(device) for k, v in batch_chosen.items()}
batch_rejected = {k: v.to(device) for k, v in batch_rejected.items()}

# BEFORE training step
with torch.no_grad():
    chosen_r_before = reward_model.get_rewards(batch_chosen['input_ids'], batch_chosen['attention_mask'])
    rejected_r_before = reward_model.get_rewards(batch_rejected['input_ids'], batch_rejected['attention_mask'])

print("\nBefore training step:")
print(f"  Chosen reward:   {chosen_r_before.item():+.4f}")
print(f"  Rejected reward: {rejected_r_before.item():+.4f}")
print(f"  Margin:          {(chosen_r_before - rejected_r_before).item():+.4f}")

# Compute loss and backprop
chosen_r = reward_model.get_rewards(batch_chosen['input_ids'], batch_chosen['attention_mask'])
rejected_r = reward_model.get_rewards(batch_rejected['input_ids'], batch_rejected['attention_mask'])
loss = compute_ranking_loss(chosen_r, rejected_r)

optimizer.zero_grad()
loss.backward()
optimizer.step()

# AFTER training step
with torch.no_grad():
    chosen_r_after = reward_model.get_rewards(batch_chosen['input_ids'], batch_chosen['attention_mask'])
    rejected_r_after = reward_model.get_rewards(batch_rejected['input_ids'], batch_rejected['attention_mask'])

print(f"\nAfter training step:")
print(f"  Chosen reward:   {chosen_r_after.item():+.4f}")
print(f"  Rejected reward: {rejected_r_after.item():+.4f}")
print(f"  Margin:          {(chosen_r_after - rejected_r_after).item():+.4f}")
print(f"  Loss:            {loss.item():.4f}")

print("\nSee what happened?")
print("  - Chosen reward increased (model likes the good response more)")
print("  - Rejected reward decreased (model likes the bad response less)")
print("  - Margin increased (model is more confident in its ranking)")
print("\nThat's learning in action!")
Demonstrating a single training step
==================================================

Before training step:
  Chosen reward:   -3.1820
  Rejected reward: +1.9091
  Margin:          -5.0911

After training step:
  Chosen reward:   -1.3308
  Rejected reward: -4.9745
  Margin:          +3.6437
  Loss:            6.3500

See what happened?
  - Chosen reward increased (model likes the good response more)
  - Rejected reward decreased (model likes the bad response less)
  - Margin increased (model is more confident in its ranking)

That's learning in action!

Hyperparameters: The Goldilocks Problem

Training reward models requires careful tuning. Too aggressive and you overfit. Too conservative and you don’t learn anything. What works in practice:

ParameterTypical ValueWhy This Matters
Learning rate1e-5Much lower than SFT! Reward training is delicate.
Batch size4Small batches (each sample has 2 sequences). Memory is tight.
Epochs1Usually just ONE pass through the data. Overfitting is a real danger.
Gradient accumulation4Effective batch size = 16. Poor man’s larger batch.
Warmup steps100Gradually increase learning rate at the start. Prevents early chaos.
Gradient clipping1.0Cap gradients to prevent explosions. Safety first.

Why such a low learning rate? Because we’re fine-tuning a pre-trained model. The base model already knows language. We just need to nudge it slightly to predict preferences. Big updates would destroy that knowledge.

Why only 1 epoch? Preference data is often small (thousands of examples, not millions). With a small dataset, you’ll overfit if you train too long. The model will memorize the training examples instead of learning general principles.

Why gradient accumulation? Memory constraints. Each training example has TWO full sequences (chosen and rejected). If you try to fit 16 pairs in memory at once, you’ll run out of VRAM. So we accumulate gradients over 4 batches of 4, then update.

# Here's a typical training configuration
config = {
    'learning_rate': 1e-5,
    'batch_size': 4,
    'num_epochs': 1,
    'warmup_steps': 100,
    'gradient_accumulation_steps': 4,
    'max_grad_norm': 1.0,
}

print("Reward Model Training Configuration")
print("=" * 50)
for k, v in config.items():
    print(f"  {k:30s} = {v}")

print("\nEffective batch size:", config['batch_size'] * config['gradient_accumulation_steps'])
print("\nThis configuration is conservative but reliable.")
print("It works for most reward modeling tasks without much tuning.")
Reward Model Training Configuration
==================================================
  learning_rate                  = 1e-05
  batch_size                     = 4
  num_epochs                     = 1
  warmup_steps                   = 100
  gradient_accumulation_steps    = 4
  max_grad_norm                  = 1.0

Effective batch size: 16

This configuration is conservative but reliable.
It works for most reward modeling tasks without much tuning.

Common Training Issues (And How to Fix Them)

Training reward models can be tricky. Here are the issues you’ll run into (and I mean will, not might), and what to do about them:

1. Low Accuracy (< 60%)

Symptoms: Model barely better than random guessing. Training accuracy stuck at 50-60%.

What’s happening: The model isn’t learning the preferences. Could be bad data, too high learning rate, or the task is genuinely hard.

Fixes:

  • Lower the learning rate (try 5e-6 instead of 1e-5)

  • Check your data quality — are the preferences clear? Would you agree with them?

  • Train for a bit longer (but watch for overfitting)

  • Try a larger base model (more capacity to learn subtle patterns)

2. Overfitting

Symptoms: Training accuracy looks great (90%+) but evaluation accuracy is much lower (60-70%). Classic overfitting.

What’s happening: Model is memorizing the training examples instead of learning general principles.

Fixes:

  • Use only 1 epoch (or even less — 50% of one epoch)

  • Increase dropout in the value head (try 0.2 or 0.3 instead of 0.1)

  • Get more training data (if possible)

  • Freeze the base model and only train the value head

  • Add more regularization (higher weight decay)

3. Training Instability

Symptoms: Loss goes down, then suddenly spikes. NaN values. Model crashes.

What’s happening: Gradients are exploding. The model is making updates that are too large.

Fixes:

  • Lower the learning rate (always the first thing to try)

  • Increase gradient clipping (try 0.5 instead of 1.0)

  • Use more warmup steps (try 200-500)

  • Check for bad data (extremely long sequences, weird characters, etc.)

The most common issue? Overfitting. It’s the reward modeling nemesis. Be conservative with training time.

What We’ve Learned

Let’s recap. Training a reward model means:

  1. Using ranking loss to teach the model that chosen > rejected

  2. Watching key metrics (accuracy, margin) to ensure learning is happening

  3. Being conservative with hyperparameters to avoid overfitting

  4. Troubleshooting when things inevitably go wrong

The math might look fancy, but it’s actually quite elegant. We’re just teaching a neural network to make comparisons. “This response is better than that one.” Over and over, thousands of times, until the model internalizes what humans mean by “better.”

Next up: evaluating reward models. Because training is only half the battle — you need to know if your reward model is actually any good.