Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Introduction to Reward Models

What is a Reward Model?

Your SFT model can follow instructions. But it doesn’t know which of its responses are actually good.

After Supervised Fine-Tuning, the model learned to mimic the training examples. But what if there are multiple valid responses? What if some are helpful while others are really helpful? The model has no way to distinguish them.

Reward models solve this.

A reward model is a neural network that predicts human preferences. You give it a prompt and a response, and it outputs a number (a “reward score”) that indicates how good that response is.

The math:

rθ(x,y)Rr_\theta(x, y) \rightarrow \mathbb{R}

Where:

  • xx is the prompt (the user’s question or instruction)

  • yy is the response (what the model generated)

  • rθr_\theta is the reward model (parameterized by weights θ\theta)

  • R\rightarrow \mathbb{R} means it outputs a real number (the reward score)

Higher score = better response.

Why Do We Need Reward Models?

Here’s a thought experiment. Rate this essay on a scale of 1 to 10.

Hard. Is it a 7? Maybe an 8? What’s the difference between a 7 and an 8 anyway?

Now: I give you two essays and ask which one is better.

Much easier. You can just compare them directly.

Humans are better at comparisons than absolute ratings. This insight is the foundation of reward modeling.

After SFT, your model can follow instructions. But it doesn’t know:

  • Which of two valid responses is better

  • How to balance competing objectives (should I be maximally helpful, or play it safe?)

  • What makes a response exceptional vs just acceptable

We teach the model human preferences by showing it comparisons:

  • “This response is better than that one”

  • “This response is better than that one”

  • “This response is better than that one”

The reward model learns from these comparisons. Eventually it can predict. for any prompt and response. How much a human would like it.

Then we use that reward model to further train the base model. But that’s RLHF, which comes later.

First: building the reward model itself.

The Bradley-Terry Model

We need a way to convert reward scores into preference probabilities. Enter the Bradley-Terry model.

This is a classic model from the 1950s developed by Ralph Bradley and Milton Terry for ranking things when you only have pairwise comparisons. Think chess rankings, or comparing sports teams.

The idea: if I show you two responses. call them “winner” and “loser” based on human preference. the probability that a human prefers the winner is:

P(ywylx)=σ(rθ(x,yw)rθ(x,yl))P(y_w \succ y_l | x) = \sigma(r_\theta(x, y_w) - r_\theta(x, y_l))

Breaking this down:

  • P(...)P(...) means “probability of...”

  • ywy_w is the winning response (the one humans preferred)

  • yly_l is the losing response (the one humans rejected)

  • ywyly_w \succ y_l reads as “ywy_w is preferred to yly_l

  • x| x means “given prompt xx

  • σ\sigma is the sigmoid function

  • rθ(x,yw)r_\theta(x, y_w) is the reward score for the winning response

  • rθ(x,yl)r_\theta(x, y_l) is the reward score for the losing response

So: “The probability that the winner is preferred over the loser equals the sigmoid of the difference in their reward scores.”

Why sigmoid? Because we need to convert a difference (which could be any real number) into a probability (which must be between 0 and 1).

How it works:

  • If the reward difference is large and positive (winner scored much higher), sigmoid outputs close to 1.0

  • If the reward difference is zero (both scored the same), sigmoid outputs 0.5 (50-50)

  • If the reward difference is large and negative (loser scored higher. the model got it wrong), sigmoid outputs close to 0.0

import torch
import torch.nn as nn
import matplotlib.pyplot as plt
import numpy as np

# Visualize the Bradley-Terry model
reward_diff = np.linspace(-5, 5, 100)
prob_prefer_chosen = 1 / (1 + np.exp(-reward_diff))  # This is the sigmoid function

plt.figure(figsize=(10, 6))
plt.plot(reward_diff, prob_prefer_chosen, linewidth=2.5, color='#2E86AB')
plt.axhline(y=0.5, color='gray', linestyle='--', alpha=0.5, label='50-50 preference')
plt.axvline(x=0, color='gray', linestyle='--', alpha=0.5, label='Equal rewards')
plt.xlabel('Reward Difference (r_chosen - r_rejected)', fontsize=12)
plt.ylabel('Probability of Preferring Chosen Response', fontsize=12)
plt.title('The Bradley-Terry Model: How Reward Differences → Preference Probabilities', fontsize=14, pad=20)
plt.grid(True, alpha=0.3)
plt.legend()
plt.tight_layout()
plt.show()

print("Understanding the curve:")
print("=" * 60)
print(f"When reward difference = 0:  P(prefer chosen) = {1/(1+np.exp(0)):.2f}")
print("  → Both responses equally good, 50-50")
print()
print(f"When reward difference = +2: P(prefer chosen) = {1/(1+np.exp(-2)):.2f}")  
print("  → Chosen response scored 2 points higher, 88% confident")
print()
print(f"When reward difference = -2: P(prefer chosen) = {1/(1+np.exp(2)):.2f}")
print("  → Chosen response scored LOWER. Only 12% confident.")
print("  → The reward model is making a mistake here.")
print()
print("The sigmoid squashes any difference into a probability.")
<Figure size 1000x600 with 1 Axes>
Understanding the curve:
============================================================
When reward difference = 0:  P(prefer chosen) = 0.50
  → Both responses equally good, 50-50

When reward difference = +2: P(prefer chosen) = 0.88
  → Chosen response scored 2 points higher, 88% confident

When reward difference = -2: P(prefer chosen) = 0.12
  → Chosen response scored LOWER. Only 12% confident.
  → The reward model is making a mistake here.

The sigmoid squashes any difference into a probability.

Reward Model Architecture

How do we actually build this thing?

A reward model is a language model with one addition: a value head.

Here’s the architecture:

Input: [prompt] [response]  ← Concatenate these together
       ↓
┌─────────────────────────┐
│   Language Model        │  ← Start with a pre-trained model
│   (GPT, LLaMA, etc.)    │     (often the same one you used for SFT)
└───────────┬─────────────┘
            │
    Get the hidden state of the last token
    (this vector "summarizes" the whole sequence)
            │
            ↓
┌─────────────────────────┐
│     Value Head          │  ← A simple linear layer
│  (Linear → Scalar)      │     (this is the only new part)
└───────────┬─────────────┘
            │
            ↓
       Reward Score

The language model reads the prompt + response and builds up a rich understanding of what’s happening. Then the value head (usually a single linear layer) converts that understanding into a single number: the reward.

You can either:

  1. Freeze the base model (only train the value head): faster, but less expressive

  2. Train everything: slower, but the base model can learn to extract features specifically useful for predicting preferences

Most people train everything.

from transformers import AutoModel, AutoTokenizer

class RewardModel(nn.Module):
    """
    A reward model for predicting human preferences.
    
    Takes a prompt + response, outputs a scalar reward score.
    """
    
    def __init__(self, base_model, hidden_size, freeze_base=False):
        super().__init__()
        self.base_model = base_model
        
        # Optionally freeze the base model (only train the value head)
        if freeze_base:
            for param in self.base_model.parameters():
                param.requires_grad = False
        
        # Value head: converts hidden state → scalar reward
        # (Just a linear layer with dropout for regularization)
        self.value_head = nn.Sequential(
            nn.Dropout(0.1),  # Prevent overfitting
            nn.Linear(hidden_size, 1)  # hidden_size → 1 number
        )
    
    def forward(self, input_ids, attention_mask):
        """
        Compute reward for an input sequence.
        
        Args:
            input_ids: Token IDs for [prompt] [response]
            attention_mask: 1 for real tokens, 0 for padding
            
        Returns:
            reward: Scalar score for this prompt-response pair
        """
        # Step 1: Run the base model to get hidden states
        outputs = self.base_model(
            input_ids=input_ids,
            attention_mask=attention_mask,
            output_hidden_states=True
        )
        
        # Step 2: Get the last hidden state
        # Shape: (batch_size, sequence_length, hidden_size)
        hidden_states = outputs.last_hidden_state
        
        # Step 3: Extract the hidden state at the LAST non-padding token
        # Why the last token? It's "seen" the entire prompt + response,
        # so it has all the context needed to judge quality.
        
        # Find the position of the last real token for each sequence
        seq_lengths = attention_mask.sum(dim=1) - 1  # -1 for 0-indexing
        
        # Index into the hidden states to grab that last position
        batch_size = hidden_states.shape[0]
        last_hidden = hidden_states[
            torch.arange(batch_size),
            seq_lengths.long()
        ]
        
        # Step 4: Pass through value head to get scalar reward
        reward = self.value_head(last_hidden).squeeze(-1)
        
        return reward


# Create a reward model
print("Building a reward model from GPT-2...")
print()

model_name = "gpt2"
base_model = AutoModel.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token  # GPT-2 doesn't have a pad token by default

reward_model = RewardModel(
    base_model,
    hidden_size=base_model.config.hidden_size,
    freeze_base=False  # Train everything
)

print(f"Reward model created.")
print()
print(f"Base model parameters: {sum(p.numel() for p in base_model.parameters()):,}")
print(f"Value head parameters: {sum(p.numel() for p in reward_model.value_head.parameters()):,}")
print()
print("That value head is tiny: just 769 parameters.")
print("(It's literally: 768-dimensional vector → 1 number)")
print("But it's enough to learn human preferences when combined with the base model.")
Building a reward model from GPT-2...

Reward model created.

Base model parameters: 124,439,808
Value head parameters: 769

That value head is tiny: just 769 parameters.
(It's literally: 768-dimensional vector → 1 number)
But it's enough to learn human preferences when combined with the base model.

Testing the Reward Model

Let’s test our (untrained) reward model.

We’ll give it two responses to “What is 2+2?”:

  1. A correct answer

  2. An “I don’t know” response

Before training, the rewards will be random. The model hasn’t learned anything about preferences yet.

# Test forward pass with two responses
test_texts = [
    "What is 2+2? The answer is 4.",
    "What is 2+2? I don't know."
]

# Tokenize both responses
inputs = tokenizer(
    test_texts,
    padding=True,  # Pad to same length
    return_tensors="pt"  # Return PyTorch tensors
)

# Run through the reward model
with torch.no_grad():  # Don't compute gradients (we're just testing)
    rewards = reward_model(
        input_ids=inputs["input_ids"],
        attention_mask=inputs["attention_mask"]
    )

print("Reward scores (before training):")
print("=" * 60)
for text, reward in zip(test_texts, rewards):
    print(f"  '{text}' → {reward.item():.4f}")

print()
print("The rewards are random.")
print("The model has no idea that the first response is better.")
print()
print("After training on preference data, we'd expect:")
print("  - First response (correct answer): HIGH reward")
print("  - Second response (unhelpful): LOW reward")
Reward scores (before training):
============================================================
  'What is 2+2? The answer is 4.' → 2.8958
  'What is 2+2? I don't know.' → 1.1211

The rewards are random.
The model has no idea that the first response is better.

After training on preference data, we'd expect:
  - First response (correct answer): HIGH reward
  - Second response (unhelpful): LOW reward

The Training Objective

We’ve got the architecture. Now: how do we train it?

We have preference data: lots of examples where humans said “Response A is better than Response B” for some prompt.

Our goal: teach the model to assign higher rewards to preferred responses.

We do this with the ranking loss (also called the “preference loss”):

L=E(x,yw,yl)[logσ(rθ(x,yw)rθ(x,yl))]\mathcal{L} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma(r_\theta(x, y_w) - r_\theta(x, y_l)) \right]

Breaking it down:

  • L\mathcal{L} is the loss we’re minimizing

  • E(...)\mathbb{E}_{(...)} means “expected value over...” (in practice: average over all training examples)

  • (x,yw,yl)(x, y_w, y_l) is one training example: a prompt xx, winning response ywy_w, and losing response yly_l

  • log\log is the natural logarithm

  • σ(...)\sigma(...) is sigmoid

  • rθ(x,yw)rθ(x,yl)r_\theta(x, y_w) - r_\theta(x, y_l) is the difference in rewards

So: “The loss is the negative log probability that we assign the correct preference.”

We negate it because we minimize loss. Maximizing log probability = minimizing negative log probability.

Intuitively:

  • If rθ(x,yw)>rθ(x,yl)r_\theta(x, y_w) > r_\theta(x, y_l) (we correctly ranked the winner higher), the loss is LOW

  • If rθ(x,yw)<rθ(x,yl)r_\theta(x, y_w) < r_\theta(x, y_l) (we got it backwards), the loss is HIGH. The gradient will push r(yw)r(y_w) up and r(yl)r(y_l) down.

  • If rθ(x,yw)rθ(x,yl)r_\theta(x, y_w) \approx r_\theta(x, y_l) (we’re not sure), the loss is medium. We’ll adjust the rewards to be more confident.

The clean thing about this loss: it doesn’t care about the absolute values of rewards, only the differences. The model can scale its rewards however it wants, as long as the rankings are correct.

Next Steps

We’ve covered the theory. Now we make it real.

In the following notebooks:

  1. Preference Data: Where does this data come from? What does it look like? How do we format it?

  2. Training: Complete implementation of the training loop. We’ll train a reward model.

  3. Evaluation: How do you know if your reward model is good? (Accuracy alone isn’t enough. we need to watch for reward hacking.)