Introduction to Supervised Fine-Tuning - An Introduction to Transformers

You know how a pre-trained language model can predict the next word in a sentence? That’s cool and all, but it’s not exactly...helpful. If you ask GPT-2 “What’s the capital of France?”, it might just continue with more questions instead of answering you.

That’s where fine-tuning comes in.

In this notebook, we’re going to learn about Supervised Fine-Tuning — the technique that transforms a raw pre-trained model into something that actually responds to instructions like a helpful assistant. It’s what makes ChatGPT feel like ChatGPT instead of just an autocomplete engine.

What is Supervised Fine-Tuning?¶

Let’s break down the acronym first: SFT stands for Supervised Fine-Tuning.

What does “supervised” mean here? Same thing it means in “supervised learning” — we’re showing the model examples of correct input-output pairs. Think of it like teaching with flashcards: “When someone asks this question, give this answer.”

The key insight is beautifully simple: if you want a model to answer questions well, show it examples of good question-answer pairs. If you want it to write code, show it examples of good code solutions. The model learns by example.

Here’s the math behind it:

\mathcal{L}_{\text{SFT}} = -\sum_{t} \log P(y_t | x, y_{<t})

(1)

Okay, but what does that actually mean? Let’s break it down symbol by symbol:

$x$ is your instruction or prompt — like “What’s the capital of France?”
$y$ is the complete response — like “The capital of France is Paris.”
$y_t$ is the token at position t in the response — maybe “Paris” is token 7
$y_{<t}$ means all the tokens before position t — everything up to but not including “Paris”
$P(y_t | x, y_{<t})$ is the probability the model assigns to the correct token, given the instruction and all previous response tokens

So we’re summing up (that $\sum$ symbol) the log probabilities across all tokens in the response. The negative sign is there because we want to maximize probability, which is the same as minimizing negative log probability. (Loss functions are things we minimize, by convention.)

In plain English: we’re teaching the model to predict each word in the response, one at a time, given the instruction and the words it’s already generated.

Pretty straightforward, right?

SFT vs Pre-Training: What’s the Difference?¶

You might be thinking: “Wait, isn’t this just...more training? What makes SFT different from pre-training?”

Great question! Let’s compare them side by side:

Aspect	Pre-Training	Supervised Fine-Tuning (SFT)
Data	Raw text from everywhere (books, websites, Wikipedia)	Carefully curated (instruction, response) pairs
Objective	Predict the next token in any text	Generate helpful responses to instructions
Scale	Trillions of tokens (massive!)	Thousands to millions of examples (much smaller)
Duration	Weeks to months on huge GPU clusters	Hours to days on a single GPU
Learning rate	Higher (around 1e-4) — the model is learning from scratch	Lower (1e-5 to 3e-4) — we’re tweaking, not rebuilding

The key difference? Pre-training teaches the model language. What words go together, grammar, facts about the world. It’s like teaching someone to read.

SFT teaches the model how to behave. How to respond when asked a question, how to format code, when to be concise vs detailed. It’s like teaching someone to be a good conversationalist.

Pre-training is expensive. SFT is cheap (relatively speaking). That’s why you pre-train once and fine-tune many times for different use cases.

Popular SFT Datasets¶

Where do these (instruction, response) pairs come from? Well, someone has to create them. Here are three famous datasets that kicked off the open-source instruction-following movement:

Alpaca (Stanford, 2023)¶

Size: 52,000 instructions
Source: Generated by GPT-3.5 based on 175 seed examples
Format: Instruction + optional input → output
Vibe: Diverse tasks, but sometimes a bit...formulaic (it’s AI-generated, after all)

Stanford trained a LLaMA model on this dataset for less than $600. That’s when everyone realized you could create capable instruction-following models on a budget.

Dolly (Databricks, 2023)¶

Size: 15,000 instructions
Source: Written by actual humans (Databricks employees!)
Quality: Higher quality than Alpaca, more natural
Limitation: Smaller, less diverse

OpenAssistant (LAION, 2023)¶

Size: 161,000 messages in conversation trees
Source: Community-contributed through a web interface
Format: Multi-turn conversations with human rankings
Cool feature: Includes quality ratings, so you can filter for the good stuff

We’ll use Alpaca in our examples because it’s clean, well-formatted, and perfect for learning.

from datasets import load_dataset

# Load the Alpaca dataset (cleaned version)
# The "cleaned" version removes some problematic examples from the original
dataset = load_dataset("yahma/alpaca-cleaned", split="train")

print(f"Dataset size: {len(dataset):,} examples")
print(f"Columns: {dataset.column_names}")
print()

# Let's look at a real example
example = dataset[0]
print("=" * 60)
print("Example 1:")
print("=" * 60)
print(f"Instruction: {example['instruction']}")
if example['input']:
    print(f"Input: {example['input']}")
print(f"Output: {example['output']}")
print()

# And another one with an input field
example = dataset[1]
print("=" * 60)
print("Example 2:")
print("=" * 60)
print(f"Instruction: {example['instruction']}")
if example['input']:
    print(f"Input: {example['input']}")
print(f"Output: {example['output'][:200]}...")  # Truncate if long

Dataset size: 51,760 examples
Columns: ['output', 'input', 'instruction']

============================================================
Example 1:
============================================================
Instruction: Give three tips for staying healthy.
Output: 1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and vegetables, lean protein, whole grains, and healthy fats. This helps to provide your body with the essential nutrients to function at its best and can help prevent chronic diseases.

2. Engage in regular physical activity: Exercise is crucial for maintaining strong bones, muscles, and cardiovascular health. Aim for at least 150 minutes of moderate aerobic exercise or 75 minutes of vigorous exercise each week.

3. Get enough sleep: Getting enough quality sleep is crucial for physical and mental well-being. It helps to regulate mood, improve cognitive function, and supports healthy growth and immune function. Aim for 7-9 hours of sleep each night.

============================================================
Example 2:
============================================================
Instruction: What are the three primary colors?
Output: The three primary colors are red, blue, and yellow. These colors are called primary because they cannot be created by mixing other colors and all other colors can be made by combining them in various ...

The SFT Training Loop: Step by Step¶

Alright, so how do we actually do supervised fine-tuning? Here’s the recipe:

1. Load a pre-trained model
Start with something that already knows language — like GPT-2, LLaMA, or Mistral. No need to train from scratch!

2. Format your instruction data
Convert your (instruction, response) pairs into a format the model understands. This usually means using a chat template that adds special tokens like <|user|> and <|assistant|>.

3. Tokenize everything
Turn text into numbers (token IDs). Both the instruction and the response get tokenized together.

4. Apply loss masking
Here’s the key trick: we only compute loss on the response tokens, not the instruction tokens. Why? Because we don’t want the model to learn to predict the instruction — we want it to learn to predict good responses to instructions.

5. Train with cross-entropy loss
Standard supervised learning. Show the model an instruction, ask it to predict the response token by token, update weights based on how well it did.

6. Save your fine-tuned model
Congratulations! You now have a model that follows instructions.

In the next notebooks, we’ll implement each of these steps in detail. But first, let’s load a model and see what we’re working with.

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# We'll use GPT-2 as our example model
# It's small (only 124M parameters), fast to download, and perfect for learning
model_name = "gpt2"

print("Loading model and tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name)

# GPT-2 doesn't have a padding token by default, so we'll set it to the EOS token
# (This is a common trick — we need a pad token for batching)
tokenizer.pad_token = tokenizer.eos_token

# Let's see what we're working with
num_params = sum(p.numel() for p in model.parameters())
print()
print(f"Model: {model_name}")
print(f"Total parameters: {num_params:,}")
print(f"Vocabulary size: {tokenizer.vocab_size:,}")
print(f"Max context length: {tokenizer.model_max_length:,} tokens")
print()
print("Note: This is the *base* GPT-2 model, not fine-tuned for instructions.")
print("It'll complete text, but it won't follow instructions. Yet.")

Loading model and tokenizer...


Model: gpt2
Total parameters: 124,439,808
Vocabulary size: 50,257
Max context length: 1,024 tokens

Note: This is the *base* GPT-2 model, not fine-tuned for instructions.
It'll complete text, but it won't follow instructions. Yet.

What’s Next?¶

Now that we understand the big picture, here are the details. The following notebooks will walk through everything you need to know:

Instruction Formatting¶

How do you structure prompts so the model knows what’s the instruction and what’s the response? We’ll learn about chat templates and why they matter.

Loss Masking¶

Why do we only compute loss on response tokens? What happens if we don’t? We’ll implement loss masking from scratch and see the difference.

The Complete Training Loop¶

Time to put it all together. We’ll train GPT-2 on Alpaca data and watch it learn to follow instructions. (It’s pretty cool.)

LoRA (Low-Rank Adaptation)¶

Training billions of parameters is expensive. LoRA lets you fine-tune just a tiny fraction of the weights and get 90% of the results. We’ll see how it works and why everyone uses it.

Ready? Let’s go!