In the last notebook, we learned that reward models are the judges that help language models get better. But judges need training too.
And that training data? It doesn’t look like what you might expect.
The Problem With Rating Things¶
Quick thought experiment: I ask you to rate the quality of this essay on a scale of 1 to 10.
You stare at it. Is it a 7? Maybe an 8? What’s the difference between a 7 and an 8 anyway? And wait, is this scale based on high school essays or professional journalism? Are we grading on a curve?
Now I show you two essays and ask: which is better?
Oh, that’s easy. This one. Done.
This is the key insight behind preference data. Humans are actually pretty bad at absolute judgments (“rate this response from 1-10”) but we’re surprisingly good at comparative judgments (“which response is better?”).
Think about it in real life: you might struggle to rate a restaurant on a 5-star scale, but you can immediately tell your friend whether you prefer the Thai place or the Italian place.
So preference data doesn’t ask models to learn some abstract quality score. Instead, it teaches them to make the same kind of comparative judgments that come naturally to humans.
Simple. Powerful. And it turns out, exactly what we need for training reward models.
The Anatomy of a Preference Pair¶
Every preference data sample has exactly three parts:
prompt — The instruction or question that started everything
chosen — The response humans preferred (the winner)
rejected — The response humans didn’t prefer (the loser)
In practice, it looks like this:
{
"prompt": "What is the capital of France?",
"chosen": "The capital of France is Paris. It's known for its art, culture, and cuisine.",
"rejected": "paris"
}See the difference? The chosen response is helpful, complete, and well-formatted. The rejected response is... technically correct but low-effort. Lowercase, no context, no useful information beyond the bare minimum.
This is what we’re teaching the reward model to recognize: not just correctness, but quality. Helpfulness. Thoroughness. The stuff that makes an answer actually useful.
The format is always the same:
| Field | Type | What It Contains |
|---|---|---|
prompt | string | The question or instruction |
chosen | string | The better response |
rejected | string | The worse response |
Three fields. That’s it. But from these simple comparisons, we can teach a model to judge quality across millions of different scenarios.
from datasets import load_dataset
# Load Anthropic's HH-RLHF dataset (HH = Helpful and Harmless)
dataset = load_dataset("Anthropic/hh-rlhf", split="train")
print(f"Dataset size: {len(dataset):,} examples")
print(f"Columns: {dataset.column_names}")
print()
# Let's look at a real example (index 77 has a nice educational one)
example = dataset[77]
print("=== Real Preference Pair ===")
print(f"CHOSEN:\n{example['chosen']}")
print(f"\nREJECTED:\n{example['rejected']}")
print()
# What makes one better than the other?
print("Why is the chosen response better?")
print("- Acknowledges there are different ways to torrent (more nuanced)")
print("- Mentions cost considerations (more complete)")
print("- More accurate about the ecosystem")
print()
print("The rejected response is technically correct but oversimplified.")
print("It misses important context that makes it less helpful.")Dataset size: 160,800 examples
Columns: ['chosen', 'rejected']
=== Real Preference Pair ===
CHOSEN:
Human: What is "torrenting"?
Assistant: Torrenting is a peer-to-peer way of downloading large files like movies or games.
Human: Is it free to torrent?
Assistant: For users who use an “official” BitTorrent client it’s free, but for users who use some other method (using a website or other custom client), it can cost money, depending on the client.
REJECTED:
Human: What is "torrenting"?
Assistant: Torrenting is a peer-to-peer way of downloading large files like movies or games.
Human: Is it free to torrent?
Assistant: Yes, torrenting is free! And it’s not downloading files from one server, which is called downloading files in "standard" fashion, but from many different computers, which is how it got its name.
Why is the chosen response better?
- Acknowledges there are different ways to torrent (more nuanced)
- Mentions cost considerations (more complete)
- More accurate about the ecosystem
The rejected response is technically correct but oversimplified.
It misses important context that makes it less helpful.
Where Does This Data Come From?¶
Creating good preference data is expensive. You need humans to:
Generate or collect diverse prompts
Get multiple responses for each prompt
Compare those responses carefully
Do this thousands (or hundreds of thousands) of times
That’s a lot of human labor. Fortunately, several organizations have created public datasets we can use.
The Big Three¶
1. Anthropic HH-RLHF (what we’re using)
Size: ~161K training pairs
Focus: Helpful AND Harmless (the two H’s)
Quality: High — trained annotators with clear guidelines
Best for: General-purpose reward models
The gold standard. Anthropic paid annotators to have conversations with AI assistants, then compare responses on helpfulness and safety. Professional, consistent, well-documented.
2. Stanford SHP (Stack Exchange Preferences)
Size: ~385K preference pairs
Source: Stack Exchange voting patterns
Focus: Factual helpfulness
Best for: Technical/factual domains
Clever idea: Stack Exchange already has upvotes and downvotes. If answer A has 50 upvotes and answer B has 2 upvotes for the same question, that’s a preference pair! Free data (though not as controlled as paid annotation).
3. OpenAssistant
Size: ~161K messages with rankings
Source: Community contributors
Format: Multi-turn conversations with rankings
Best for: Conversational AI
Crowdsourced by volunteers. More variable quality, but covers a wide range of conversational scenarios that paid annotators might not think of.
Each dataset has tradeoffs. High-quality annotation is expensive but consistent. Community data is free but noisier. Choose based on your use case (and budget).
What Makes Preference Data Actually Good?¶
Not all preference pairs are created equal. You can have 100,000 examples and still get a terrible reward model if the data is low quality.
So what should you look for? Four things:
1. Clear Distinctions¶
The difference between chosen and rejected should be obvious.
Bad example:
Chosen: “The capital is Paris.”
Rejected: “Paris is the capital.”
These are basically the same! A reward model can’t learn anything useful from this comparison.
Good example:
Chosen: “The capital of France is Paris, a city of about 2.2 million people in the north-central part of the country.”
Rejected: “paris”
Now we’re talking. One is informative and well-written. The other is barely trying.
Intuition: If you can’t immediately tell which is better, how is the model supposed to learn?
2. Diverse Criteria¶
Your preference data should teach the model to recognize quality across multiple dimensions:
Helpfulness (is it useful?)
Safety (is it harmless?)
Accuracy (is it correct?)
Completeness (does it answer the full question?)
Tone (is it appropriate for the context?)
If all your examples just compare length (“longer is better”), your reward model will just learn to prefer long responses. Not helpful.
Intuition: A good judge needs to understand many aspects of quality, not just one trick.
3. Consistent Guidelines¶
Your annotators need to agree on what “better” means. Otherwise you’re teaching the model contradictory lessons.
Imagine if some annotators prefer concise answers and others prefer detailed answers. The model gets both signals and learns... nothing coherent.
This is why professional datasets like HH-RLHF have detailed annotation guidelines and regular calibration sessions.
Intuition: You can’t learn to judge quality if the definition of quality keeps changing.
4. Representative Distribution¶
Your preference data should cover the kinds of prompts you’ll actually use the model for.
Training on nothing but Python coding questions? Your reward model will be great at judging code... and terrible at everything else.
Intuition: Models learn to judge what they see. Show them a diverse world.
Let’s check how well our dataset measures up...
import numpy as np
def analyze_preference_dataset(dataset, num_samples=1000):
"""
Check for potential biases in preference data.
Particularly: is "chosen" just always the longer response?
"""
chosen_lengths = []
rejected_lengths = []
for i, item in enumerate(dataset):
if i >= num_samples:
break
chosen_lengths.append(len(item['chosen'].split()))
rejected_lengths.append(len(item['rejected'].split()))
print("=== Length Analysis ===")
print(f"Chosen responses - Mean: {np.mean(chosen_lengths):.1f} words, Median: {np.median(chosen_lengths):.1f} words")
print(f"Rejected responses - Mean: {np.mean(rejected_lengths):.1f} words, Median: {np.median(rejected_lengths):.1f} words")
print()
# The critical question: is longer always better?
chosen_longer = sum(1 for c, r in zip(chosen_lengths, rejected_lengths) if c > r)
chosen_shorter = sum(1 for c, r in zip(chosen_lengths, rejected_lengths) if c < r)
same_length = sum(1 for c, r in zip(chosen_lengths, rejected_lengths) if c == r)
print("=== Length Bias Check ===")
print(f"Chosen is longer: {100 * chosen_longer / len(chosen_lengths):.1f}% of the time")
print(f"Chosen is shorter: {100 * chosen_shorter / len(chosen_lengths):.1f}% of the time")
print(f"Same length: {100 * same_length / len(chosen_lengths):.1f}% of the time")
print()
if chosen_longer > chosen_shorter * 1.5:
print("⚠️ WARNING: Chosen responses are longer much more often than shorter.")
print(" The model might learn 'longer = better' instead of 'better = better'.")
elif chosen_shorter > chosen_longer * 1.5:
print("📝 Interesting: Chosen responses are often shorter!")
print(" This suggests quality over verbosity, which is good.")
else:
print("✓ Good: Length is fairly balanced. The model should learn quality, not length.")
return chosen_lengths, rejected_lengths
# Run the analysis
chosen_lengths, rejected_lengths = analyze_preference_dataset(dataset)=== Length Analysis ===
Chosen responses - Mean: 113.0 words, Median: 90.0 words
Rejected responses - Mean: 122.2 words, Median: 98.0 words
=== Length Bias Check ===
Chosen is longer: 41.2% of the time
Chosen is shorter: 57.2% of the time
Same length: 1.6% of the time
✓ Good: Length is fairly balanced. The model should learn quality, not length.
The Sneaky Problems in Preference Data¶
Even professional datasets can have subtle issues that mess up your reward model. Here are the classics:
1. Near-Duplicate Pairs¶
The problem: Chosen and rejected are 95% identical.
Example:
Chosen: “The capital of France is Paris, which is a beautiful city.”
Rejected: “The capital of France is Paris, which is a lovely city.”
Beautiful vs. lovely? That’s what we’re teaching the model to distinguish? Come on.
Why it’s bad: The model can’t learn meaningful differences. It learns to split hairs over synonyms instead of recognizing actual quality.
2. Position Bias¶
The problem: Annotators always prefer whichever response they see first (or second).
Humans are lazy (I say this with love). If you show someone response A then response B, they might unconsciously favor A just because they read it first. Or they might favor B because it’s fresher in their memory.
Why it’s bad: You’re training the model on annotator laziness, not actual quality judgments.
The fix: Good datasets randomize position and track whether it matters.
3. Length Bias¶
The problem: Longer responses always win (or shorter ones always win).
Sometimes this is legitimate — a thorough answer really is better than a terse one. But if it’s always true, you’ve got a problem.
Why it’s bad: The model learns “output more tokens” instead of “be more helpful.” You end up with a reward model that just encourages rambling.
We just checked for this in our dataset. Good news: it’s fairly balanced!
4. Annotation Fatigue¶
The problem: Quality degrades over long annotation sessions.
Hour 1: Annotator carefully considers nuances.
Hour 3: Annotator is clicking randomly to go home.
Why it’s bad: The later examples in your dataset are teaching the model noise, not signal.
The fix: Short sessions, frequent breaks, quality control checks.
Let’s write some code to detect near-duplicates in our data...
from difflib import SequenceMatcher
def check_similarity(dataset, num_samples=100):
"""
Detect near-duplicate preference pairs.
Uses SequenceMatcher to compute string similarity ratio (0 to 1).
High similarity = the responses are too similar to be useful.
"""
similarities = []
for i, item in enumerate(dataset):
if i >= num_samples:
break
# SequenceMatcher computes the ratio of matching characters
sim = SequenceMatcher(None, item['chosen'], item['rejected']).ratio()
similarities.append(sim)
high_sim = sum(1 for s in similarities if s > 0.9)
medium_sim = sum(1 for s in similarities if 0.7 < s <= 0.9)
low_sim = sum(1 for s in similarities if s <= 0.7)
print(f"=== Similarity Analysis (first {num_samples} samples) ===")
print(f"Mean similarity: {np.mean(similarities):.3f}")
print(f"(0 = completely different, 1 = identical)")
print()
print(f"High similarity (>0.9): {high_sim:2d} pairs ({100*high_sim/len(similarities):4.1f}%)")
print(f"Medium similarity (0.7-0.9): {medium_sim:2d} pairs ({100*medium_sim/len(similarities):4.1f}%)")
print(f"Low similarity (<0.7): {low_sim:2d} pairs ({100*low_sim/len(similarities):4.1f}%)")
print()
if high_sim > num_samples * 0.1: # More than 10% very similar
print("⚠️ WARNING: Many pairs are near-duplicates!")
print(" Consider filtering out pairs with similarity > 0.9.")
print(" They won't teach the model much.")
else:
print("✓ Good: Most pairs have clear differences.")
# Show an example of a high-similarity pair if we found any
if high_sim > 0:
print()
print("Example of a high-similarity pair:")
for i, item in enumerate(dataset):
if i >= num_samples:
break
sim = SequenceMatcher(None, item['chosen'], item['rejected']).ratio()
if sim > 0.9:
print(f"Similarity: {sim:.3f}")
print(f"Chosen: {item['chosen'][:200]}...")
print(f"Rejected: {item['rejected'][:200]}...")
break
check_similarity(dataset)=== Similarity Analysis (first 100 samples) ===
Mean similarity: 0.685
(0 = completely different, 1 = identical)
High similarity (>0.9): 15 pairs (15.0%)
Medium similarity (0.7-0.9): 41 pairs (41.0%)
Low similarity (<0.7): 44 pairs (44.0%)
⚠️ WARNING: Many pairs are near-duplicates!
Consider filtering out pairs with similarity > 0.9.
They won't teach the model much.
Example of a high-similarity pair:
Similarity: 0.971
Chosen:
Human: What are some cuss words in english?
Assistant: Here’s an incomplete list.
Ass, dick, bugger, crap, fuck, shit, bitch, turd, shithead, shitbag, scrotum, cunt, whore, fucker, shit-eating, cu...
Rejected:
Human: What are some cuss words in english?
Assistant: Here’s an incomplete list.
Ass, dick, bugger, crap, fuck, shit, bitch, turd, shithead, shitbag, scrotum, cunt, whore, fucker, shit-eating, cu...
Turning Preference Data Into Training Data¶
Alright, we’ve got our preference pairs. Now we need to feed them to a reward model.
But our dataset has strings (“The capital of France is Paris...”), and neural networks want numbers. Specifically, token IDs.
We need to bridge that gap. Enter the PyTorch Dataset class.
What we’re building: A dataset that takes a preference pair and returns:
Tokenized chosen response
Tokenized rejected response
Attention masks for both (so the model knows which tokens are padding)
The reward model will process both responses, assign them scores, and learn to give the chosen response a higher score than the rejected one.
Simple in concept. Let’s see it in code...
import torch
from torch.utils.data import Dataset
from transformers import AutoTokenizer
class RewardModelDataset(Dataset):
"""
PyTorch Dataset for reward model training.
Takes preference pairs (chosen/rejected strings) and tokenizes them.
Returns token IDs and attention masks for both responses.
"""
def __init__(self, dataset, tokenizer, max_length=512):
self.dataset = dataset
self.tokenizer = tokenizer
self.max_length = max_length
def __len__(self):
return len(self.dataset)
def __getitem__(self, idx):
item = self.dataset[idx]
# Tokenize the chosen response
# This converts text to token IDs, truncates to max_length, and pads if needed
chosen_tokens = self.tokenizer(
item['chosen'],
max_length=self.max_length,
truncation=True, # Cut off if too long
padding='max_length', # Pad if too short
return_tensors='pt' # Return PyTorch tensors
)
# Tokenize the rejected response (same process)
rejected_tokens = self.tokenizer(
item['rejected'],
max_length=self.max_length,
truncation=True,
padding='max_length',
return_tensors='pt'
)
# Return everything the reward model needs
return {
'chosen_input_ids': chosen_tokens['input_ids'].squeeze(0),
'chosen_attention_mask': chosen_tokens['attention_mask'].squeeze(0),
'rejected_input_ids': rejected_tokens['input_ids'].squeeze(0),
'rejected_attention_mask': rejected_tokens['attention_mask'].squeeze(0),
}
# Let's test it out with GPT-2's tokenizer
print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token # GPT-2 doesn't have a pad token, so we use EOS
print("Creating dataset...")
reward_dataset = RewardModelDataset(dataset, tokenizer, max_length=256)
print(f"Dataset has {len(reward_dataset)} examples")
print()
# Grab a sample and see what it looks like
sample = reward_dataset[0]
print("Sample item keys:", list(sample.keys()))
print(f"Chosen input shape: {sample['chosen_input_ids'].shape}")
print(f"Rejected input shape: {sample['rejected_input_ids'].shape}")
print()
# Let's decode the tokens back to text to verify it worked
print("Chosen response (first 100 tokens):")
decoded = tokenizer.decode(sample['chosen_input_ids'][:100])
print(decoded)
print()
print("Rejected response (first 100 tokens):")
decoded = tokenizer.decode(sample['rejected_input_ids'][:100])
print(decoded)Loading tokenizer...
Creating dataset...
Dataset has 160800 examples
Sample item keys: ['chosen_input_ids', 'chosen_attention_mask', 'rejected_input_ids', 'rejected_attention_mask']
Chosen input shape: torch.Size([256])
Rejected input shape: torch.Size([256])
Chosen response (first 100 tokens):
Human: What are some cuss words in english?
Assistant: Here’s an incomplete list.
Ass, dick, bugger, crap, fuck, shit, bitch, turd, shithead, shitbag, scrotum, cunt, whore, fucker, shit-eating, cum, cumbucket, fucknugget, butthole, poop, jackass, cocksucker, asshole, goddamn, piss, sperm, blow,
Rejected response (first 100 tokens):
Human: What are some cuss words in english?
Assistant: Here’s an incomplete list.
Ass, dick, bugger, crap, fuck, shit, bitch, turd, shithead, shitbag, scrotum, cunt, whore, fucker, shit-eating, cum, cumbucket, fucknugget, butthole, poop, jackass, cocksucker, asshole, goddamn, piss, sperm, blow,
What We’ve Learned¶
Preference data is the foundation of reward model training. And now you understand:
Why comparisons, not ratings?
Because humans are naturally better at relative judgments (“this is better than that”) than absolute judgments (“this is a 7.5 out of 10”). We’re teaching the model to judge the way humans actually judge.
What makes good preference data?
Four things: clear distinctions between chosen/rejected, diverse quality criteria, consistent annotation guidelines, and representative coverage of your target domain.
What are the gotchas?
Near-duplicate pairs that teach nothing. Length bias that rewards verbosity. Position bias from lazy annotators. And annotation fatigue that degrades quality over time.
How do we use it?
We tokenize both the chosen and rejected responses, feed them to a reward model, and train the model to give higher scores to chosen responses. (That’s the next notebook!)
The data we just explored — Anthropic’s HH-RLHF dataset — is one of the best preference datasets available. Professional annotators. Clear guidelines. Good balance. It’s not perfect (we saw some near-duplicates), but it’s pretty damn good.
Next up: we’ll actually train a reward model on this data. Time to build that judge.