Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Try It Yourself!

This is it. The capstone. All those notebooks you just went through? Time to put them into practice.

We’re going to fine-tune a language model from scratch. Not a toy example. A real model that actually learns to follow instructions. By the end of this, you’ll have your own fine-tuned GPT-2 sitting on your hard drive, ready to answer questions.

Sound good? Let’s go.

What You’re About to Build

We’re taking a base GPT-2 model—one that’s pretty good at predicting the next word but terrible at following instructions—and teaching it to be helpful.

What you’ll learn:

  • How to actually train a model (not just read about it)

  • Why supervised fine-tuning works

  • How to evaluate if your model is any good

  • What to watch out for when things go wrong

Time investment: 30-60 minutes, depending on your hardware. Got a GPU? Closer to 30. Running on CPU? Grab a coffee and make it an hour.

Important note: This is a simplified version for learning. Production fine-tuning would use LoRA (which we covered), more data, and better hyperparameter tuning. But the core ideas? Exactly the same.

Step 1: Verify Your Environment

First things first—let’s make sure you’ve got PyTorch installed and that it can see your GPU (if you have one).

Why check this first? Because finding out 20 minutes into training that CUDA isn’t working is... well, let’s just say it’s a learning experience you only need once.

# Check what we're working with
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print("Great! Training will be fast.")
else:
    print("Device: CPU")
    print("No GPU found. Training will work but be slower (10-20x).")
    print("Consider reducing num_samples in the data loading step.")
PyTorch version: 2.10.0.dev20251124+rocm7.1
CUDA available: True
GPU: Radeon RX 7900 XTX
Great! Training will be fast.
# Import everything we need
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModelForCausalLM, AutoTokenizer, get_linear_schedule_with_warmup
from datasets import load_dataset
from tqdm.auto import tqdm
import numpy as np

print("All imports successful!")
print("\nIf you got any errors above, install missing packages with:")
print("  pip install transformers datasets torch tqdm")
All imports successful!

If you got any errors above, install missing packages with:
  pip install transformers datasets torch tqdm

Step 2: Load the Base Model

We’re using GPT-2 (the small version, 124M parameters). Why GPT-2 and not something bigger?

  1. It’s fast to train - You can actually finish this notebook today

  2. It’s well-understood - Lots of documentation if things break

  3. It’s big enough to learn - 124M parameters is plenty for instruction following

Think of this as the “before” photo. The model right now is decent at continuing text but hopeless at following instructions. We’re about to fix that.

# Load GPT-2 (small, 124M parameters)
model_name = "gpt2"

print(f"Loading {model_name}...")
print("This downloads ~500MB the first time, then caches locally.\n")

model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# GPT-2 doesn't have a padding token by default, so we add one
# We just reuse the EOS token—common practice and works fine
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

# Move to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)

# How big is this thing?
total_params = sum(p.numel() for p in model.parameters())
print(f"✓ Model loaded!")
print(f"  Parameters: {total_params:,}")
print(f"  Device: {device}")
print(f"  Memory: ~{total_params * 4 / 1e9:.1f} GB (in float32)")
Loading gpt2...
This downloads ~500MB the first time, then caches locally.

✓ Model loaded!
  Parameters: 124,439,808
  Device: cuda
  Memory: ~0.5 GB (in float32)

Step 3: Test the Base Model (Before Training)

Okay, moment of truth. Let’s see what the base model does when we ask it questions.

It’s going to be bad. Really bad. That’s the point.

Base GPT-2 was trained to predict the next token in internet text. It was never taught to answer questions. So when you ask it “What is the capital of France?” it just... continues the pattern of text it sees. Sometimes that works by accident. Usually it doesn’t.

Watch:

def generate_response(model, tokenizer, instruction, max_new_tokens=100):
    """
    Generate a response to an instruction.
    
    We format the prompt in "Alpaca style"—a specific template that works well
    for instruction following. You'll see this same format in the training data.
    """
    prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():  # Don't track gradients for inference
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,  # Some randomness (0 = deterministic, 1 = very random)
            top_p=0.9,  # Nucleus sampling
            do_sample=True,  # Use sampling instead of greedy
            pad_token_id=tokenizer.eos_token_id
        )
    
    full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the response part (after "### Response:")
    response = full_text.split("### Response:\n")[-1].strip()
    
    return response

# Test base model on a few questions
test_instructions = [
    "What is the capital of France?",
    "Write a haiku about programming.",
    "Explain machine learning in one sentence.",
]

print("BASE MODEL (before fine-tuning):")
print("=" * 70)
print("Watch how it fails to actually answer the questions...\n")

for instruction in test_instructions:
    print(f"Q: {instruction}")
    response = generate_response(model, tokenizer, instruction)
    # Truncate long responses
    if len(response) > 200:
        response = response[:200] + "..."
    print(f"A: {response}")
    print("-" * 70)
BASE MODEL (before fine-tuning):
======================================================================
Watch how it fails to actually answer the questions...

Q: What is the capital of France?
A: The response type is
----------------------------------------------------------------------
Q: Write a haiku about programming.
A: Write a response that includes some code.

### Example:

#include <stdio.h> #include <sys/types.h> int main(int argc, char **argv[]) { int i, j; char *data = argc->get_data(); for (i = 0; i < argv[1];...
----------------------------------------------------------------------
Q: Explain machine learning in one sentence.
A: The following example shows how to generate the response with a single line of code.

#!/usr/bin/python # python.py import requests import requests.py import requests.py import requests.py.model impor...
----------------------------------------------------------------------

See what I mean? The model just rambles. It’s not trying to answer the question—it’s trying to continue text that looks like the prompt.

That’s because base GPT-2 was trained on raw internet text with a simple objective: predict the next word. No one ever taught it that text formatted as “Instruction:” and “Response:” means it should actually answer the question.

That’s what we’re about to fix with supervised fine-tuning.

Step 4: Prepare Training Data

We’re using the Alpaca dataset—52,000 instruction-response pairs created by Stanford. Things like:

  • Instruction: “Give three tips for staying healthy”

  • Response: “1. Eat a balanced diet. 2. Exercise regularly. 3. Get enough sleep.”

Perfect for teaching a model to follow instructions.

Key insight: We’ll use a small subset (500 examples) for speed. This is enough to see the model learn! For production use, you’d train on the full dataset.

# Load the Alpaca dataset
print("Loading Alpaca dataset from HuggingFace...")
raw_dataset = load_dataset("yahma/alpaca-cleaned", split="train")

# Use a small subset for quick training
# Feel free to adjust this:
#   - 100 samples: Very fast, model learns a bit (2-3 min on GPU)
#   - 500 samples: Good learning, reasonable time (10-15 min on GPU)
#   - 5000+ samples: Better results, longer training (1+ hour)
num_samples = 500

raw_dataset = raw_dataset.select(range(num_samples))

print(f"\n✓ Dataset loaded: {len(raw_dataset)} training examples")
print(f"\nHere's what one example looks like:")
print(f"  Instruction: {raw_dataset[0]['instruction']}")
if raw_dataset[0]['input']:
    print(f"  Input: {raw_dataset[0]['input']}")
print(f"  Output: {raw_dataset[0]['output'][:100]}...")
print(f"\nThe model will learn to generate 'Output' given 'Instruction' (and 'Input' if present).")
Loading Alpaca dataset from HuggingFace...

✓ Dataset loaded: 500 training examples

Here's what one example looks like:
  Instruction: Give three tips for staying healthy.
  Output: 1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and...

The model will learn to generate 'Output' given 'Instruction' (and 'Input' if present).
class InstructionDataset(Dataset):
    """
    Dataset for instruction fine-tuning.
    
    The magic here is in the LABEL MASKING. We only compute loss on the response
    tokens, not the instruction tokens. Why? Because we want the model to learn
    to GENERATE responses, not to predict the instruction itself.
    
    This is crucial. Without it, the model would waste capacity learning to 
    predict the instruction template, which is useless.
    """
    
    def __init__(self, data, tokenizer, max_length=256):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.data)
    
    def format_example(self, example):
        """Format in Alpaca style (same as our generate function)."""
        if example['input']:
            # Some examples have an additional 'input' field for context
            prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
"""
        else:
            prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{example['instruction']}

### Response:
"""
        return prompt, example['output']
    
    def __getitem__(self, idx):
        example = self.data[idx]
        prompt, response = self.format_example(example)
        
        # Tokenize prompt and response separately (important!)
        prompt_tokens = self.tokenizer.encode(prompt, add_special_tokens=True)
        response_tokens = self.tokenizer.encode(response, add_special_tokens=False)
        
        # Combine: [prompt tokens] + [response tokens] + [EOS]
        input_ids = prompt_tokens + response_tokens + [self.tokenizer.eos_token_id]
        
        # Create labels: -100 for prompt (ignored in loss), actual tokens for response
        # This is the key to supervised fine-tuning!
        labels = [-100] * len(prompt_tokens) + response_tokens + [self.tokenizer.eos_token_id]
        
        # Truncate if too long
        if len(input_ids) > self.max_length:
            input_ids = input_ids[:self.max_length]
            labels = labels[:self.max_length]
        
        # Pad to max_length (makes batching easier)
        padding_length = self.max_length - len(input_ids)
        input_ids = input_ids + [self.tokenizer.pad_token_id] * padding_length
        labels = labels + [-100] * padding_length  # Ignore padding in loss
        attention_mask = [1] * (self.max_length - padding_length) + [0] * padding_length
        
        return {
            'input_ids': torch.tensor(input_ids),
            'attention_mask': torch.tensor(attention_mask),
            'labels': torch.tensor(labels),
        }

# Create dataset and dataloader
train_dataset = InstructionDataset(raw_dataset, tokenizer, max_length=256)
train_loader = DataLoader(
    train_dataset, 
    batch_size=4,  # Small batch size to fit in memory
    shuffle=True   # Randomize order each epoch
)

print(f"✓ Created dataset with {len(train_dataset)} samples")
print(f"  Batches per epoch: {len(train_loader)}")
print(f"  Batch size: 4")
print(f"\nEach batch contains:")
print(f"  - input_ids: The tokenized text (prompt + response)")
print(f"  - attention_mask: Which tokens to pay attention to (1) vs ignore (0)")
print(f"  - labels: What to predict (-100 for prompt, actual tokens for response)")
✓ Created dataset with 500 samples
  Batches per epoch: 125
  Batch size: 4

Each batch contains:
  - input_ids: The tokenized text (prompt + response)
  - attention_mask: Which tokens to pay attention to (1) vs ignore (0)
  - labels: What to predict (-100 for prompt, actual tokens for response)

Step 5: Set Up Training

Time to configure the training loop. A few key decisions here:

Learning rate (5e-5): Small enough to not destroy the pretrained weights, large enough to actually learn. This is a well-tested default for fine-tuning.

Warmup steps (50): Gradually increase the learning rate for the first 50 steps. Helps with training stability—like stretching before a run.

Gradient clipping (1.0): Prevents any single bad batch from causing chaos. If gradients get too large, we scale them down. Think of it as a safety rail.

One epoch: With 500 examples, one pass through the data is enough to see learning. More epochs would help, but we’re going for speed here.

# Training hyperparameters
learning_rate = 5e-5  # Standard for fine-tuning (0.00005)
num_epochs = 1        # One pass through the data
warmup_steps = 50     # Gradually increase LR for first 50 steps
max_grad_norm = 1.0   # Clip gradients to prevent instability

# Set up optimizer (AdamW is standard for transformers)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Learning rate scheduler (warmup then linear decay)
total_steps = len(train_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

print("Training configuration:")
print(f"  Learning rate: {learning_rate}")
print(f"  Epochs: {num_epochs}")
print(f"  Steps per epoch: {len(train_loader)}")
print(f"  Total steps: {total_steps}")
print(f"  Warmup steps: {warmup_steps}")
print(f"  Gradient clipping: {max_grad_norm}")
print(f"\nEstimated time: ~10-15 minutes on GPU, ~2 hours on CPU")
Training configuration:
  Learning rate: 5e-05
  Epochs: 1
  Steps per epoch: 125
  Total steps: 125
  Warmup steps: 50
  Gradient clipping: 1.0

Estimated time: ~10-15 minutes on GPU, ~2 hours on CPU
# The actual training loop
print("\nStarting training...")
print("Watch the loss go down! (Lower = better)")
print("\nMetrics explained:")
print("  - loss: How wrong the model is (lower = better)")
print("  - avg_loss: Running average of loss")
print("  - ppl: Perplexity (e^loss), another way to measure quality")
print("\nGo grab a coffee. This'll take a few minutes...\n")

model.train()  # Put model in training mode

for epoch in range(num_epochs):
    total_loss = 0
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}")
    
    for step, batch in enumerate(progress_bar):
        # Move batch to GPU
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # Forward pass: compute loss
        outputs = model(
            input_ids=batch['input_ids'],
            attention_mask=batch['attention_mask'],
            labels=batch['labels']
        )
        loss = outputs.loss
        
        # Backward pass: compute gradients
        optimizer.zero_grad()  # Clear old gradients
        loss.backward()        # Compute new gradients
        
        # Clip gradients (prevent explosions)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        
        # Update weights
        optimizer.step()
        scheduler.step()
        
        # Track metrics
        total_loss += loss.item()
        avg_loss = total_loss / (step + 1)
        
        # Update progress bar
        progress_bar.set_postfix({
            'loss': f'{loss.item():.4f}',
            'avg_loss': f'{avg_loss:.4f}',
            'ppl': f'{np.exp(avg_loss):.2f}'
        })
    
    print(f"\n✓ Epoch {epoch+1} complete!")
    print(f"  Final average loss: {avg_loss:.4f}")
    print(f"  Final perplexity: {np.exp(avg_loss):.2f}")

print("\n🎉 Training complete!")
print("\nThe model has now seen 500 examples of how to follow instructions.")
print("Let's see if it actually learned anything...")

Starting training...
Watch the loss go down! (Lower = better)

Metrics explained:
  - loss: How wrong the model is (lower = better)
  - avg_loss: Running average of loss
  - ppl: Perplexity (e^loss), another way to measure quality

Go grab a coffee. This'll take a few minutes...

Loading...
`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.

✓ Epoch 1 complete!
  Final average loss: 2.4032
  Final perplexity: 11.06

🎉 Training complete!

The model has now seen 500 examples of how to follow instructions.
Let's see if it actually learned anything...

Step 6: Test the Fine-Tuned Model

Moment of truth. Same questions as before, but now the model has been fine-tuned.

Will it actually answer the questions this time? Let’s find out.

(If the answers are still gibberish, don’t panic—check the training loss. If it went down, the model learned something. You might just need more training steps or better hyperparameters.)

# Switch to evaluation mode (disables dropout, etc.)
model.eval()

print("FINE-TUNED MODEL (after training):")
print("=" * 70)
print("Same questions as before. Notice the difference?\n")

for instruction in test_instructions:
    print(f"Q: {instruction}")
    response = generate_response(model, tokenizer, instruction)
    print(f"A: {response}")
    print("-" * 70)

print("\nMuch better, right?")
print("\nThe model isn't perfect (it's only seen 500 examples), but it's actually")
print("trying to answer the questions now instead of just rambling.")
print("\nThat's the power of supervised fine-tuning!")
FINE-TUNED MODEL (after training):
======================================================================
Same questions as before. Notice the difference?

Q: What is the capital of France?
A: The capital of France is Paris, the capital of France.

France is a major European city, located in the French Alps, and the largest city in the world, with over 3,000,000 inhabitants. It is also home to the French government, the World Bank, and the Royal Society.

France is also home to the largest and most powerful military in the world, the French Air Force. It is also home to a large number of foreign embassies, as well as numerous
----------------------------------------------------------------------
Q: Write a haiku about programming.
A: Programming is a form of expression, the process by which a program can be executed. It is a means of expressing a specific thought, or feeling, and is an important tool for communication.

In the past, programs were written in many different ways, from simple programs written to complex code that was executed by hand. However, in the past, programming has become much more complex and complex, making it difficult for programmers to express their thoughts and feelings without breaking the flow of code.
----------------------------------------------------------------------
Q: Explain machine learning in one sentence.
A: Machine learning is a powerful, automated, and scalable technology that helps us improve our processes and learn more about our customers. It allows us to better understand their needs and behaviors, and can help us identify and predict patterns in data and make decisions about how to improve our services. Machine learning can be used for business and financial purposes, as well as for training and forecasting.
----------------------------------------------------------------------

Much better, right?

The model isn't perfect (it's only seen 500 examples), but it's actually
trying to answer the questions now instead of just rambling.

That's the power of supervised fine-tuning!
# Let's try some more examples to really see what it can do
additional_tests = [
    "List three benefits of exercise.",
    "What is Python used for?",
    "Explain what a neural network is in simple terms.",
    "Write a short poem about the ocean.",
]

print("\nLet's try some different questions:")
print("=" * 70)

for instruction in additional_tests:
    print(f"\nQ: {instruction}")
    response = generate_response(model, tokenizer, instruction)
    print(f"A: {response}")
    print("-" * 70)

print("\n**Key observation:** The model has learned the *pattern* of instruction-following,")
print("not just memorized specific facts. It generalizes to new questions!")
print("\nThough sometimes it gets a bit... creative. (That's LLMs for you.)")

Let's try some different questions:
======================================================================

Q: List three benefits of exercise.
A: 1. It reduces stress and anxiety. Exercise reduces stress and anxiety.
2. It increases physical activity. Exercise increases physical activity.
3. It reduces stress and anxiety. Exercise reduces stress and anxiety.
----------------------------------------------------------------------

Q: What is Python used for?
A: Python is a powerful and powerful programming language that makes it possible to create, manage, and manage complex systems, including databases, databases, and applications. Python is widely used by businesses, governments, and other organizations to manage and manage their data and processes. Python is also used for creating and managing large databases, applications, and applications that provide web and mobile applications.
----------------------------------------------------------------------

Q: Explain what a neural network is in simple terms.
A: A neural network is a collection of neural networks that encode, process, and process information, typically as part of a process called learning. Neural networks are typically thought of as the "deep-learning" machine learning system that is used to learn from known information. Neural networks are used to perform tasks, such as image recognition, image manipulation, and so on, but they are also used to process information such as data and other information. Neural networks are often referred to as machine learning systems because they have
----------------------------------------------------------------------

Q: Write a short poem about the ocean.
A: This poem is about the ocean, or the beauty of the sea.
----------------------------------------------------------------------

**Key observation:** The model has learned the *pattern* of instruction-following,
not just memorized specific facts. It generalizes to new questions!

Though sometimes it gets a bit... creative. (That's LLMs for you.)

Step 7: Quantitative Evaluation

Okay, so the model seems better based on the examples. But how do we measure that objectively?

Two key metrics:

  1. Perplexity: How “surprised” the model is by the training data. Lower = better. It’s basically e^(loss). Think of it as “confidence”—how well does the model predict what comes next?

  2. Diversity: Do all the responses sound the same, or does the model have variety? We measure this with distinct-1 and distinct-2 (percentage of unique words and word pairs). Too low = mode collapse (model stuck in a rut).

Let’s compute both.

def compute_perplexity(model, dataloader, device):
    """
    Compute perplexity on a dataset.
    
    Perplexity = e^(average loss)
    
    Think of it as: "On average, how many equally-likely tokens could come next?"
    Lower is better. Random guessing on a 50k vocab = perplexity of 50,000.
    A well-trained model on instructions = perplexity of 5-10.
    """
    model.eval()
    total_loss = 0
    total_tokens = 0
    
    with torch.no_grad():  # No gradients needed for evaluation
        for batch in tqdm(dataloader, desc="Computing perplexity"):
            batch = {k: v.to(device) for k, v in batch.items()}
            
            outputs = model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                labels=batch['labels']
            )
            
            # Count non-masked tokens (only response tokens, not prompt)
            num_tokens = (batch['labels'] != -100).sum().item()
            total_loss += outputs.loss.item() * num_tokens
            total_tokens += num_tokens
    
    avg_loss = total_loss / total_tokens
    perplexity = np.exp(avg_loss)
    
    return perplexity, avg_loss

# Compute perplexity on the training set
# (In practice, you'd use a held-out validation set, but we're keeping it simple)
print("\nEvaluating model quality...")
perplexity, loss = compute_perplexity(model, train_loader, device)

print(f"\n✓ Evaluation complete!")
print(f"  Loss: {loss:.4f}")
print(f"  Perplexity: {perplexity:.2f}")
print(f"\nInterpretation:")
print(f"  - Perplexity < 10: Excellent")
print(f"  - Perplexity 10-20: Good")
print(f"  - Perplexity 20-50: Okay")
print(f"  - Perplexity > 50: Needs more training")
print(f"\nYour model: ", end="")
if perplexity < 10:
    print("Excellent! 🎉")
elif perplexity < 20:
    print("Good! 👍")
elif perplexity < 50:
    print("Okay. More training would help.")
else:
    print("Needs more training. Try more epochs or more data.")

Evaluating model quality...
Loading...

✓ Evaluation complete!
  Loss: 1.9571
  Perplexity: 7.08

Interpretation:
  - Perplexity < 10: Excellent
  - Perplexity 10-20: Good
  - Perplexity 20-50: Okay
  - Perplexity > 50: Needs more training

Your model: Excellent! 🎉
def compute_diversity(responses):
    """
    Compute diversity metrics for generated text.
    
    Distinct-1: Percentage of unique words (unigrams)
    Distinct-2: Percentage of unique word pairs (bigrams)
    
    Why does this matter? If the model always says "the the the the" you'd have
    low diversity even if perplexity looks okay. Diversity catches mode collapse.
    """
    all_unigrams = []
    all_bigrams = []
    
    for response in responses:
        tokens = response.lower().split()
        all_unigrams.extend(tokens)
        # Create pairs of consecutive words
        all_bigrams.extend(zip(tokens[:-1], tokens[1:]))
    
    # What fraction of words/pairs are unique?
    distinct_1 = len(set(all_unigrams)) / len(all_unigrams) if all_unigrams else 0
    distinct_2 = len(set(all_bigrams)) / len(all_bigrams) if all_bigrams else 0
    
    return distinct_1, distinct_2

# Generate a bunch of responses for diversity analysis
diversity_prompts = [
    "Tell me about machine learning.",
    "Explain artificial intelligence.",
    "What is deep learning?",
    "Describe natural language processing.",
    "Explain what data science is.",
]

print("\nGenerating responses for diversity analysis...")
responses = [generate_response(model, tokenizer, p) for p in diversity_prompts]
d1, d2 = compute_diversity(responses)

print(f"\n✓ Diversity analysis complete!")
print(f"  Distinct-1 (unique words): {d1:.2%}")
print(f"  Distinct-2 (unique word pairs): {d2:.2%}")
print(f"\nInterpretation:")
print(f"  - Distinct-1 > 40%: Good variety")
print(f"  - Distinct-1 20-40%: Okay")
print(f"  - Distinct-1 < 20%: Mode collapse (model stuck repeating itself)")
print(f"\nYour model: ", end="")
if d1 > 0.4:
    print("Good variety! 🎉")
elif d1 > 0.2:
    print("Okay diversity.")
else:
    print("Warning: Low diversity. Try different sampling parameters.")

Generating responses for diversity analysis...

✓ Diversity analysis complete!
  Distinct-1 (unique words): 30.77%
  Distinct-2 (unique word pairs): 59.43%

Interpretation:
  - Distinct-1 > 40%: Good variety
  - Distinct-1 20-40%: Okay
  - Distinct-1 < 20%: Mode collapse (model stuck repeating itself)

Your model: Okay diversity.

Step 8: Save Your Model

You just spent 15 minutes training this thing. Let’s not lose it!

Saving is simple—we just dump the model weights and tokenizer config to disk. Then you can reload them later (or share them with others).

# Save model and tokenizer to disk
save_path = "./my_finetuned_model"

print(f"Saving model to {save_path}...")
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print("✓ Model saved!")

# Show what got saved
import os
print(f"\nSaved files:")
total_size = 0
for f in sorted(os.listdir(save_path)):
    size = os.path.getsize(os.path.join(save_path, f)) / 1e6
    total_size += size
    print(f"  {f}: {size:.1f} MB")

print(f"\nTotal size: {total_size:.1f} MB")
print(f"\nYou can now load this model anytime with:")
print(f"  model = AutoModelForCausalLM.from_pretrained('{save_path}')")
print(f"  tokenizer = AutoTokenizer.from_pretrained('{save_path}')")
Saving model to ./my_finetuned_model...
✓ Model saved!

Saved files:
  config.json: 0.0 MB
  generation_config.json: 0.0 MB
  merges.txt: 0.5 MB
  model.safetensors: 497.8 MB
  special_tokens_map.json: 0.0 MB
  tokenizer.json: 3.6 MB
  tokenizer_config.json: 0.0 MB
  vocab.json: 0.8 MB

Total size: 502.6 MB

You can now load this model anytime with:
  model = AutoModelForCausalLM.from_pretrained('./my_finetuned_model')
  tokenizer = AutoTokenizer.from_pretrained('./my_finetuned_model')
# Let's verify the saved model actually works
print("Testing saved model (to make sure saving worked)...")

loaded_model = AutoModelForCausalLM.from_pretrained(save_path)
loaded_tokenizer = AutoTokenizer.from_pretrained(save_path)
loaded_model = loaded_model.to(device)
loaded_model.eval()

test_instruction = "What is the meaning of life?"
response = generate_response(loaded_model, loaded_tokenizer, test_instruction)

print(f"\n✓ Loaded model from disk successfully!")
print(f"\nTest question: {test_instruction}")
print(f"Answer: {response}")
print(f"\nLooks good! Your model is saved and ready to use.")
Testing saved model (to make sure saving worked)...

✓ Loaded model from disk successfully!

Test question: What is the meaning of life?
Answer: Life is a complex and interconnected system of life, and we live by our own choices and desires. We make our own decisions about our lives, and we learn from our experiences to make better choices.

Looks good! Your model is saved and ready to use.

You Did It! 🎉

Seriously. You just fine-tuned a language model from scratch.

What you accomplished:

  1. Loaded a base GPT-2 model (terrible at following instructions)

  2. Prepared training data with proper label masking

  3. Trained the model using supervised fine-tuning

  4. Watched it go from gibberish to actual answers

  5. Evaluated it with perplexity and diversity metrics

  6. Saved it for later use

This is the same basic process used to create ChatGPT, Claude, and every other instruction-following LLM. The production versions use more data, bigger models, LoRA for efficiency, and RLHF for alignment—but the core idea is exactly what you just did.

What to Try Next

Now that you’ve got the basics down:

  1. Train longer - Try 3-5 epochs or use the full Alpaca dataset (52k examples)

  2. Use LoRA - Fine-tune only a small number of parameters (way more efficient)

  3. Try DPO - Align the model with human preferences using the reward/preference notebooks

  4. Bigger models - GPT-2 Medium/Large, or even Llama if you’ve got the VRAM

  5. Your own data - Got a specific task? Create a dataset and fine-tune for it!

Common Issues & Tips

Loss not going down?

  • Check your learning rate (try 1e-5 to 1e-4)

  • Make sure labels are masked properly (prompt tokens should be -100)

  • Try more epochs or more data

Model output is repetitive?

  • Adjust temperature and top_p during generation

  • Check diversity metrics (distinct-1/distinct-2)

  • Might need more varied training data

Out of memory?

  • Reduce batch_size (try 2 or 1)

  • Reduce max_length (try 128 or 64)

  • Use gradient checkpointing (more compute, less memory)

  • Consider LoRA (way less memory)

Answers are still bad?

  • Train on more data (500 examples is pretty small)

  • Train for more epochs

  • Check that loss actually decreased during training

Final Thoughts

The model you just trained isn’t perfect. It might hallucinate, give weird answers, or ramble sometimes. That’s normal! You trained it on 500 examples for 10 minutes.

What matters is that you understand the process. You know how to:

  • Load and prepare data

  • Set up a training loop

  • Evaluate results

  • Debug when things go wrong

That’s the hard part. Scaling up to production is just... more of the same, but bigger.

Go build something cool. 🚀