Try It Yourself - An Introduction to Transformers

This is it. The capstone. All those notebooks you just went through? Time to put them into practice.

We’re going to fine-tune a language model from scratch. Not a toy example. A real model that actually learns to follow instructions. By the end of this, you’ll have your own fine-tuned GPT-2 sitting on your hard drive, ready to answer questions.

What You’re About to Build¶

We’re taking a base GPT-2 model. one that’s pretty good at predicting the next word but terrible at following instructions. and teaching it to be helpful.

What you’ll learn:

How to actually train a model (not just read about it)
Why supervised fine-tuning works
How to evaluate if your model is any good
What to watch out for when things go wrong

Time investment: 30-60 minutes, depending on your hardware. Got a GPU? Closer to 30. Running on CPU? Grab a coffee and make it an hour.

Important note: This is a simplified version for learning. Production fine-tuning would use LoRA (which we covered), more data, and better hyperparameter tuning. But the core ideas? Exactly the same.

Step 1: Verify Your Environment¶

First things first. let’s make sure you’ve got PyTorch installed and that it can see your GPU (if you have one).

Why check this first? Because finding out 20 minutes into training that CUDA isn’t working is.. well, let’s just say it’s a learning experience you only need once.

# Check what we're working with
import torch

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"GPU: {torch.cuda.get_device_name(0)}")
    print("Great! Training will be fast.")
else:
    print("Device: CPU")
    print("No GPU found. Training will work but be slower (10-20x).")
    print("Consider reducing num_samples in the data loading step.")

PyTorch version: 2.10.0+cu130
CUDA available: True
GPU: NVIDIA GeForce RTX 5090
Great! Training will be fast.

# Import everything we need
import torch
import torch.nn.functional as F
from torch.utils.data import Dataset, DataLoader
from transformers import AutoModelForCausalLM, AutoTokenizer, get_linear_schedule_with_warmup
from datasets import load_dataset
from tqdm.auto import tqdm
import numpy as np

print("All imports successful!")
print("\nIf you got any errors above, install missing packages with:")
print("  pip install transformers datasets torch tqdm")

All imports successful!

If you got any errors above, install missing packages with:
  pip install transformers datasets torch tqdm

Step 2: Load the Base Model¶

We’re using GPT-2 (the small version, 124M parameters). Why GPT-2 and not something bigger?

It’s fast to train - You can actually finish this notebook today
It’s well-understood - Lots of documentation if things break
It’s big enough to learn - 124M parameters is plenty for instruction following

Think of this as the “before” photo. The model right now is decent at continuing text but hopeless at following instructions. We’re about to fix that.

# Load GPT-2 (small, 124M parameters)
model_name = "gpt2"

print(f"Loading {model_name}...")
print("This downloads ~500MB the first time, then caches locally.\n")

model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# GPT-2 doesn't have a padding token by default, so we add one
# We just reuse the EOS token. common practice and works fine
tokenizer.pad_token = tokenizer.eos_token
model.config.pad_token_id = tokenizer.eos_token_id

# Move to GPU if available
if torch.cuda.is_available():
    device = torch.device("cuda")
elif torch.backends.mps.is_available():
    device = torch.device("mps")
else:
    device = torch.device("cpu")
model = model.to(device)

# How big is this thing?
total_params = sum(p.numel() for p in model.parameters())
print(f"✓ Model loaded!")
print(f"  Parameters: {total_params:,}")
print(f"  Device: {device}")
print(f"  Memory: ~{total_params * 4 / 1e9:.1f} GB (in float32)")

Loading gpt2...
This downloads ~500MB the first time, then caches locally.

✓ Model loaded!
  Parameters: 124,439,808
  Device: cuda
  Memory: ~0.5 GB (in float32)

Step 3: Test the Base Model (Before Training)¶

Okay, moment of truth. Let’s see what the base model does when we ask it questions.

It’s going to be bad. Really bad. That’s the point.

Base GPT-2 was trained to predict the next token in internet text. It was never taught to answer questions. So when you ask it “What is the capital of France?” it just.. continues the pattern of text it sees. Sometimes that works by accident. Usually it doesn’t.

Watch:

def generate_response(model, tokenizer, instruction, max_new_tokens=100):
    """
    Generate a response to an instruction.
    
    We format the prompt in "Alpaca style". A specific template that works well
    for instruction following. You'll see this same format in the training data.
    """
    prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{instruction}

### Response:
"""
    
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():  # Don't track gradients for inference
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=0.7,  # Some randomness (0 = deterministic, 1 = very random)
            top_p=0.9,  # Nucleus sampling
            do_sample=True,  # Use sampling instead of greedy
            pad_token_id=tokenizer.eos_token_id
        )
    
    full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
    # Extract just the response part (after "### Response:")
    response = full_text.split("### Response:\n")[-1].strip()
    
    return response

# Test base model on a few questions
test_instructions = [
    "What is the capital of France?",
    "Write a haiku about programming.",
    "Explain machine learning in one sentence.",
]

print("BASE MODEL (before fine-tuning):")
print("=" * 70)
print("Watch how it fails to actually answer the questions...\n")

for instruction in test_instructions:
    print(f"Q: {instruction}")
    response = generate_response(model, tokenizer, instruction)
    # Truncate long responses
    if len(response) > 200:
        response = response[:200] + "..."
    print(f"A: {response}")
    print("-" * 70)

BASE MODEL (before fine-tuning):
======================================================================
Watch how it fails to actually answer the questions...

Q: What is the capital of France?

A: A: I am French and the capital of France is Paris.

A: How do you pronounce it?

A: French

A: What is your first name?

A: I'm a French citizen and I'm from Paris.

A: What is your first name?

A: I'...
----------------------------------------------------------------------
Q: Write a haiku about programming.

A: Write a haiku about programming.

###############################################################################

# The haiku will be written in C, but the actual program will be written in JavaScrip...
----------------------------------------------------------------------
Q: Explain machine learning in one sentence.

A: This is a response to a machine learning question.

## Response:

This is a response to a machine learning question.

## Response
----------------------------------------------------------------------

See what I mean? The model just rambles. It’s not trying to answer the question. it’s trying to continue text that looks like the prompt.

That’s because base GPT-2 was trained on raw internet text with a simple objective: predict the next word. No one ever taught it that text formatted as “Instruction:” and “Response:” means it should actually answer the question.

That’s what we’re about to fix with supervised fine-tuning.

Step 4: Prepare Training Data¶

We’re using the Alpaca dataset: 52,000 instruction-response pairs created by Stanford. Things like:

Instruction: “Give three tips for staying healthy”
Response: “1. Eat a balanced diet. 2. Exercise regularly. 3. Get enough sleep.”

Perfect for teaching a model to follow instructions.

Key insight: We’ll use a small subset (500 examples) for speed. This is enough to see the model learn! For production use, you’d train on the full dataset.

# Load the Alpaca dataset
print("Loading Alpaca dataset from HuggingFace...")
raw_dataset = load_dataset("yahma/alpaca-cleaned", split="train")

# Use a small subset for quick training
# Feel free to adjust this:
#   - 100 samples: Very fast, model learns a bit (2-3 min on GPU)
#   - 500 samples: Good learning, reasonable time (10-15 min on GPU)
#   - 5000+ samples: Better results, longer training (1+ hour)
num_samples = 500

raw_dataset = raw_dataset.select(range(num_samples))

print(f"\n✓ Dataset loaded: {len(raw_dataset)} training examples")
print(f"\nHere's what one example looks like:")
print(f"  Instruction: {raw_dataset[0]['instruction']}")
if raw_dataset[0]['input']:
    print(f"  Input: {raw_dataset[0]['input']}")
print(f"  Output: {raw_dataset[0]['output'][:100]}...")
print(f"\nThe model will learn to generate 'Output' given 'Instruction' (and 'Input' if present).")

Loading Alpaca dataset from HuggingFace...


✓ Dataset loaded: 500 training examples

Here's what one example looks like:
  Instruction: Give three tips for staying healthy.
  Output: 1. Eat a balanced and nutritious diet: Make sure your meals are inclusive of a variety of fruits and...

The model will learn to generate 'Output' given 'Instruction' (and 'Input' if present).

class InstructionDataset(Dataset):
    """
    Dataset for instruction fine-tuning.
    
    The key insight here is in the LABEL MASKING. We only compute loss on the response
    tokens, not the instruction tokens. Why? Because we want the model to learn
    to GENERATE responses, not to predict the instruction itself.
    
    This is crucial. Without it, the model would waste capacity learning to 
    predict the instruction template, which is useless.
    """
    
    def __init__(self, data, tokenizer, max_length=256):
        self.data = data
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __len__(self):
        return len(self.data)
    
    def format_example(self, example):
        """Format in Alpaca style (same as our generate function)."""
        if example['input']:
            # Some examples have an additional 'input' field for context
            prompt = f"""Below is an instruction that describes a task, paired with an input that provides further context. Write a response that appropriately completes the request.

### Instruction:
{example['instruction']}

### Input:
{example['input']}

### Response:
"""
        else:
            prompt = f"""Below is an instruction that describes a task. Write a response that appropriately completes the request.

### Instruction:
{example['instruction']}

### Response:
"""
        return prompt, example['output']
    
    def __getitem__(self, idx):
        example = self.data[idx]
        prompt, response = self.format_example(example)
        
        # Tokenize prompt and response separately (important!)
        prompt_tokens = self.tokenizer.encode(prompt, add_special_tokens=True)
        response_tokens = self.tokenizer.encode(response, add_special_tokens=False)
        
        # Combine: [prompt tokens] + [response tokens] + [EOS]
        input_ids = prompt_tokens + response_tokens + [self.tokenizer.eos_token_id]
        
        # Create labels: -100 for prompt (ignored in loss), actual tokens for response
        # This is the key to supervised fine-tuning!
        labels = [-100] * len(prompt_tokens) + response_tokens + [self.tokenizer.eos_token_id]
        
        # Truncate if too long
        if len(input_ids) > self.max_length:
            input_ids = input_ids[:self.max_length]
            labels = labels[:self.max_length]
        
        # Pad to max_length (makes batching easier)
        padding_length = self.max_length - len(input_ids)
        input_ids = input_ids + [self.tokenizer.pad_token_id] * padding_length
        labels = labels + [-100] * padding_length  # Ignore padding in loss
        attention_mask = [1] * (self.max_length - padding_length) + [0] * padding_length
        
        return {
            'input_ids': torch.tensor(input_ids),
            'attention_mask': torch.tensor(attention_mask),
            'labels': torch.tensor(labels),
        }

# Create dataset and dataloader
train_dataset = InstructionDataset(raw_dataset, tokenizer, max_length=256)
train_loader = DataLoader(
    train_dataset, 
    batch_size=4,  # Small batch size to fit in memory
    shuffle=True   # Randomize order each epoch
)

print(f"✓ Created dataset with {len(train_dataset)} samples")
print(f"  Batches per epoch: {len(train_loader)}")
print(f"  Batch size: 4")
print(f"\nEach batch contains:")
print(f"  - input_ids: The tokenized text (prompt + response)")
print(f"  - attention_mask: Which tokens to pay attention to (1) vs ignore (0)")
print(f"  - labels: What to predict (-100 for prompt, actual tokens for response)")

✓ Created dataset with 500 samples
  Batches per epoch: 125
  Batch size: 4

Each batch contains:
  - input_ids: The tokenized text (prompt + response)
  - attention_mask: Which tokens to pay attention to (1) vs ignore (0)
  - labels: What to predict (-100 for prompt, actual tokens for response)

Step 5: Set Up Training¶

Time to configure the training loop. A few key decisions here:

Learning rate (5e-5): Small enough to not destroy the pretrained weights, large enough to actually learn. This is a well-tested default for fine-tuning.

Warmup steps (50): Gradually increase the learning rate for the first 50 steps. Helps with training stability. like stretching before a run.

Gradient clipping (1.0): Prevents any single bad batch from causing chaos. If gradients get too large, we scale them down. Think of it as a safety rail.

One epoch: With 500 examples, one pass through the data is enough to see learning. More epochs would help, but we’re going for speed here.

# Training hyperparameters
learning_rate = 5e-5  # Standard for fine-tuning (0.00005)
num_epochs = 1        # One pass through the data
warmup_steps = 50     # Gradually increase LR for first 50 steps
max_grad_norm = 1.0   # Clip gradients to prevent instability

# Set up optimizer (AdamW is standard for transformers)
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Learning rate scheduler (warmup then linear decay)
total_steps = len(train_loader) * num_epochs
scheduler = get_linear_schedule_with_warmup(
    optimizer,
    num_warmup_steps=warmup_steps,
    num_training_steps=total_steps
)

print("Training configuration:")
print(f"  Learning rate: {learning_rate}")
print(f"  Epochs: {num_epochs}")
print(f"  Steps per epoch: {len(train_loader)}")
print(f"  Total steps: {total_steps}")
print(f"  Warmup steps: {warmup_steps}")
print(f"  Gradient clipping: {max_grad_norm}")
print(f"\nEstimated time: ~10-15 minutes on GPU, ~2 hours on CPU")

Training configuration:
  Learning rate: 5e-05
  Epochs: 1
  Steps per epoch: 125
  Total steps: 125
  Warmup steps: 50
  Gradient clipping: 1.0

Estimated time: ~10-15 minutes on GPU, ~2 hours on CPU

# The actual training loop
print("\nStarting training...")
print("Watch the loss go down! (Lower = better)")
print("\nMetrics explained:")
print("  - loss: How wrong the model is (lower = better)")
print("  - avg_loss: Running average of loss")
print("  - ppl: Perplexity (e^loss), another way to measure quality")
print("\nGo grab a coffee. This'll take a few minutes...\n")

model.train()  # Put model in training mode

for epoch in range(num_epochs):
    total_loss = 0
    progress_bar = tqdm(train_loader, desc=f"Epoch {epoch+1}/{num_epochs}")
    
    for step, batch in enumerate(progress_bar):
        # Move batch to GPU
        batch = {k: v.to(device) for k, v in batch.items()}
        
        # Forward pass: compute loss
        outputs = model(
            input_ids=batch['input_ids'],
            attention_mask=batch['attention_mask'],
            labels=batch['labels']
        )
        loss = outputs.loss
        
        # Backward pass: compute gradients
        optimizer.zero_grad()  # Clear old gradients
        loss.backward()        # Compute new gradients
        
        # Clip gradients (prevent explosions)
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
        
        # Update weights
        optimizer.step()
        scheduler.step()
        
        # Track metrics
        total_loss += loss.item()
        avg_loss = total_loss / (step + 1)
        
        # Update progress bar
        progress_bar.set_postfix({
            'loss': f'{loss.item():.4f}',
            'avg_loss': f'{avg_loss:.4f}',
            'ppl': f'{np.exp(avg_loss):.2f}'
        })
    
    print(f"\n✓ Epoch {epoch+1} complete!")
    print(f"  Final average loss: {avg_loss:.4f}")
    print(f"  Final perplexity: {np.exp(avg_loss):.2f}")

print("\n🎉 Training complete!")
print("\nThe model has now seen 500 examples of how to follow instructions.")
print("Let's see if it actually learned anything...")


Starting training...
Watch the loss go down! (Lower = better)

Metrics explained:
  - loss: How wrong the model is (lower = better)
  - avg_loss: Running average of loss
  - ppl: Perplexity (e^loss), another way to measure quality

Go grab a coffee. This'll take a few minutes...

`loss_type=None` was set in the config but it is unrecognized. Using the default loss: `ForCausalLMLoss`.


✓ Epoch 1 complete!
  Final average loss: 2.3917
  Final perplexity: 10.93

🎉 Training complete!

The model has now seen 500 examples of how to follow instructions.
Let's see if it actually learned anything...

Step 6: Test the Fine-Tuned Model¶

Moment of truth. Same questions as before, but now the model has been fine-tuned.

Will it actually answer the questions this time? Let’s find out.

(If the answers are still gibberish, don’t panic. check the training loss. If it went down, the model learned something. You might just need more training steps or better hyperparameters.)

# Switch to evaluation mode (disables dropout, etc.)
model.eval()

print("FINE-TUNED MODEL (after training):")
print("=" * 70)
print("Same questions as before. Notice the difference?\n")

for instruction in test_instructions:
    print(f"Q: {instruction}")
    response = generate_response(model, tokenizer, instruction)
    print(f"A: {response}")
    print("-" * 70)

print("\nMuch better, right?")
print("\nThe model isn't perfect (it's only seen 500 examples), but it's actually")
print("trying to answer the questions now instead of just rambling.")
print("\nThat's the power of supervised fine-tuning!")

FINE-TUNED MODEL (after training):
======================================================================
Same questions as before. Notice the difference?

Q: What is the capital of France?

A: The capital of France is Paris, the capital of France, with its capital in the city of Paris and its capital in the city of Saint-Denis.
----------------------------------------------------------------------
Q: Write a haiku about programming.

A: "Hiku" is a term that has been used in various forms since the early 2000s to describe a set of short, simple, but effective programming language that can be used to communicate with other programmers. This language has been used in many different contexts, from school play to work-related tasks.
----------------------------------------------------------------------
Q: Explain machine learning in one sentence.

A: Machine learning is a new field of research that focuses on the processing of information in a machine learning model. The main focus of machine learning is on neural networks, or neural networks, that process data and learn from it. These networks are typically used in machine learning applications such as deep learning and deep learning networks, and are often used to learn from large datasets or models.

Machine learning models can be trained with several techniques, including supervised learning, reinforcement learning, and deep learning algorithms. This process
----------------------------------------------------------------------

Much better, right?

The model isn't perfect (it's only seen 500 examples), but it's actually
trying to answer the questions now instead of just rambling.

That's the power of supervised fine-tuning!

# Let's try some more examples to really see what it can do
additional_tests = [
    "List three benefits of exercise.",
    "What is Python used for?",
    "Explain what a neural network is in simple terms.",
    "Write a short poem about the ocean.",
]

print("\nLet's try some different questions:")
print("=" * 70)

for instruction in additional_tests:
    print(f"\nQ: {instruction}")
    response = generate_response(model, tokenizer, instruction)
    print(f"A: {response}")
    print("-" * 70)

print("\n**Key observation:** The model has learned the *pattern* of instruction-following,")
print("not just memorized specific facts. It generalizes to new questions!")
print("\nThough sometimes it gets a bit.. creative. (That's LLMs for you.)")


Let's try some different questions:
======================================================================

Q: List three benefits of exercise.

A: 1. Exercise reduces stress and anxiety.

2. Exercise lowers your risk of developing depression.

3. Exercise increases your energy levels and helps you manage your stress.

4. Exercise increases your physical fitness and energy levels.

5. Exercise increases your productivity and productivity.

6. Exercise increases your productivity and productivity.

7. Exercise improves your sleep patterns.

8. Exercise reduces stress and anxiety.

9. Exercise improves your mood and
----------------------------------------------------------------------

Q: What is Python used for?

A: Python is a programming language that provides a set of basic programming tools that can be used in many different applications. The main features of Python are:

# The Python interpreter. Python interprets Python code and its state.

# The data type. Python data types can be used to create, modify, and manipulate data.

# The object-oriented programming style. Python objects can be made up of functions, classes, methods, and methods, allowing for complex and complex programming.
----------------------------------------------------------------------

Q: Explain what a neural network is in simple terms.

A: A neural network is a type of neural network that is composed of a set of neural networks, each with a specific purpose and a specific set of algorithms. The neural network is a network of networks, which are connected by a single neural network, each with a specific purpose. Neural networks are typically thought of as a type of computer network, where a computer program or a set of algorithms can be modeled to understand a specific pattern of neural activity in the brain.

The term neural network is commonly
----------------------------------------------------------------------

Q: Write a short poem about the ocean.
A: The ocean is a beautiful, vibrant, and beautiful place, filled with beauty, joy, and adventure.
----------------------------------------------------------------------

**Key observation:** The model has learned the *pattern* of instruction-following,
not just memorized specific facts. It generalizes to new questions!

Though sometimes it gets a bit.. creative. (That's LLMs for you.)

Step 7: Quantitative Evaluation¶

Okay, so the model seems better based on the examples. But how do we measure that objectively?

Two key metrics:

Perplexity: How “surprised” the model is by the training data. Lower = better. It’s e^(loss). Think of it as “confidence”. how well does the model predict what comes next?
Diversity: Do all the responses sound the same, or does the model have variety? We measure this with distinct-1 and distinct-2 (percentage of unique words and word pairs). Too low = mode collapse (model stuck in a rut).

Let’s compute both.

def compute_perplexity(model, dataloader, device):
    """
    Compute perplexity on a dataset.
    
    Perplexity = e^(average loss)
    
    Think of it as: "On average, how many equally-likely tokens could come next?"
    Lower is better. Random guessing on a 50k vocab = perplexity of 50,000.
    A well-trained model on instructions = perplexity of 5-10.
    """
    model.eval()
    total_loss = 0
    total_tokens = 0
    
    with torch.no_grad():  # No gradients needed for evaluation
        for batch in tqdm(dataloader, desc="Computing perplexity"):
            batch = {k: v.to(device) for k, v in batch.items()}
            
            outputs = model(
                input_ids=batch['input_ids'],
                attention_mask=batch['attention_mask'],
                labels=batch['labels']
            )
            
            # Count non-masked tokens (only response tokens, not prompt)
            num_tokens = (batch['labels'] != -100).sum().item()
            total_loss += outputs.loss.item() * num_tokens
            total_tokens += num_tokens
    
    avg_loss = total_loss / total_tokens
    perplexity = np.exp(avg_loss)
    
    return perplexity, avg_loss

# Compute perplexity on the training set
# (In practice, you'd use a held-out validation set, but we're keeping it simple)
print("\nEvaluating model quality...")
perplexity, loss = compute_perplexity(model, train_loader, device)

print(f"\n✓ Evaluation complete!")
print(f"  Loss: {loss:.4f}")
print(f"  Perplexity: {perplexity:.2f}")
print(f"\nInterpretation:")
print(f"  - Perplexity < 10: Excellent")
print(f"  - Perplexity 10-20: Good")
print(f"  - Perplexity 20-50: Okay")
print(f"  - Perplexity > 50: Needs more training")
print(f"\nYour model: ", end="")
if perplexity < 10:
    print("Excellent! 🎉")
elif perplexity < 20:
    print("Good! 👍")
elif perplexity < 50:
    print("Okay. More training would help.")
else:
    print("Needs more training. Try more epochs or more data.")


Evaluating model quality...


✓ Evaluation complete!
  Loss: 1.9605
  Perplexity: 7.10

Interpretation:
  - Perplexity < 10: Excellent
  - Perplexity 10-20: Good
  - Perplexity 20-50: Okay
  - Perplexity > 50: Needs more training

Your model: Excellent! 🎉

def compute_diversity(responses):
    """
    Compute diversity metrics for generated text.
    
    Distinct-1: Percentage of unique words (unigrams)
    Distinct-2: Percentage of unique word pairs (bigrams)
    
    Why does this matter? If the model always says "the the the the" you'd have
    low diversity even if perplexity looks okay. Diversity catches mode collapse.
    """
    all_unigrams = []
    all_bigrams = []
    
    for response in responses:
        tokens = response.lower().split()
        all_unigrams.extend(tokens)
        # Create pairs of consecutive words
        all_bigrams.extend(zip(tokens[:-1], tokens[1:]))
    
    # What fraction of words/pairs are unique?
    distinct_1 = len(set(all_unigrams)) / len(all_unigrams) if all_unigrams else 0
    distinct_2 = len(set(all_bigrams)) / len(all_bigrams) if all_bigrams else 0
    
    return distinct_1, distinct_2

# Generate a bunch of responses for diversity analysis
diversity_prompts = [
    "Tell me about machine learning.",
    "Explain artificial intelligence.",
    "What is deep learning?",
    "Describe natural language processing.",
    "Explain what data science is.",
]

print("\nGenerating responses for diversity analysis...")
responses = [generate_response(model, tokenizer, p) for p in diversity_prompts]
d1, d2 = compute_diversity(responses)

print(f"\n✓ Diversity analysis complete!")
print(f"  Distinct-1 (unique words): {d1:.2%}")
print(f"  Distinct-2 (unique word pairs): {d2:.2%}")
print(f"\nInterpretation:")
print(f"  - Distinct-1 > 40%: Good variety")
print(f"  - Distinct-1 20-40%: Okay")
print(f"  - Distinct-1 < 20%: Mode collapse (model stuck repeating itself)")
print(f"\nYour model: ", end="")
if d1 > 0.4:
    print("Good variety! 🎉")
elif d1 > 0.2:
    print("Okay diversity.")
else:
    print("Warning: Low diversity. Try different sampling parameters.")


Generating responses for diversity analysis...


✓ Diversity analysis complete!
  Distinct-1 (unique words): 29.63%
  Distinct-2 (unique word pairs): 52.22%

Interpretation:
  - Distinct-1 > 40%: Good variety
  - Distinct-1 20-40%: Okay
  - Distinct-1 < 20%: Mode collapse (model stuck repeating itself)

Your model: Okay diversity.

Step 8: Save Your Model¶

You just spent 15 minutes training this thing. Let’s not lose it!

Saving is simple. we just dump the model weights and tokenizer config to disk. Then you can reload them later (or share them with others).

# Save model and tokenizer to disk
save_path = "./my_finetuned_model"

print(f"Saving model to {save_path}...")
model.save_pretrained(save_path)
tokenizer.save_pretrained(save_path)
print("✓ Model saved!")

# Show what got saved
import os
print(f"\nSaved files:")
total_size = 0
for f in sorted(os.listdir(save_path)):
    size = os.path.getsize(os.path.join(save_path, f)) / 1e6
    total_size += size
    print(f"  {f}: {size:.1f} MB")

print(f"\nTotal size: {total_size:.1f} MB")
print(f"\nYou can now load this model anytime with:")
print(f"  model = AutoModelForCausalLM.from_pretrained('{save_path}')")
print(f"  tokenizer = AutoTokenizer.from_pretrained('{save_path}')")

Saving model to ./my_finetuned_model...

✓ Model saved!

Saved files:
  config.json: 0.0 MB
  generation_config.json: 0.0 MB
  merges.txt: 0.5 MB
  model.safetensors: 497.8 MB
  special_tokens_map.json: 0.0 MB
  tokenizer.json: 3.6 MB
  tokenizer_config.json: 0.0 MB
  vocab.json: 0.8 MB

Total size: 502.6 MB

You can now load this model anytime with:
  model = AutoModelForCausalLM.from_pretrained('./my_finetuned_model')
  tokenizer = AutoTokenizer.from_pretrained('./my_finetuned_model')

# Let's verify the saved model actually works
print("Testing saved model (to make sure saving worked)...")

loaded_model = AutoModelForCausalLM.from_pretrained(save_path)
loaded_tokenizer = AutoTokenizer.from_pretrained(save_path)
loaded_model = loaded_model.to(device)
loaded_model.eval()

test_instruction = "What is the meaning of life?"
response = generate_response(loaded_model, loaded_tokenizer, test_instruction)

print(f"\n✓ Loaded model from disk successfully!")
print(f"\nTest question: {test_instruction}")
print(f"Answer: {response}")
print(f"\nLooks good! Your model is saved and ready to use.")

Testing saved model (to make sure saving worked)...


✓ Loaded model from disk successfully!

Test question: What is the meaning of life?
Answer: Life is a complex and deeply meaningful experience.

Looks good! Your model is saved and ready to use.

You Did It! 🎉¶

Seriously. You just fine-tuned a language model from scratch.

What you accomplished:

Loaded a base GPT-2 model (terrible at following instructions)
Prepared training data with proper label masking
Trained the model using supervised fine-tuning
Watched it go from gibberish to actual answers
Evaluated it with perplexity and diversity metrics
Saved it for later use

This is the same basic process used to create ChatGPT, Claude, and every other instruction-following LLM. The production versions use more data, bigger models, LoRA for efficiency, and RLHF for alignment. but the core idea is exactly what you just did.

What to Try Next¶

Now that you’ve got the basics down:

Train longer - Try 3-5 epochs or use the full Alpaca dataset (52k examples)
Use LoRA - Fine-tune only a small number of parameters (way more efficient)
Try DPO - Align the model with human preferences using the reward/preference notebooks
Bigger models - GPT-2 Medium/Large, or even Llama if you’ve got the VRAM
Your own data - Got a specific task? Create a dataset and fine-tune for it!

Common Issues & Tips¶

Loss not going down?

Check your learning rate (try 1e-5 to 1e-4)
Make sure labels are masked properly (prompt tokens should be -100)
Try more epochs or more data

Model output is repetitive?

Adjust temperature and top_p during generation
Check diversity metrics (distinct-1/distinct-2)
Might need more varied training data

Out of memory?

Reduce batch_size (try 2 or 1)
Reduce max_length (try 128 or 64)
Use gradient checkpointing (more compute, less memory)
Consider LoRA (way less memory)

Answers are still bad?

Train on more data (500 examples is pretty small)
Train for more epochs
Check that loss actually decreased during training

Final Thoughts¶

The model you just trained isn’t perfect. It might hallucinate, give weird answers, or ramble sometimes. That’s normal! You trained it on 500 examples for 10 minutes.

What matters is that you understand the process. You know how to:

Load and prepare data
Set up a training loop
Evaluate results
Debug when things go wrong

That’s the hard part. Scaling up to production is just.. more of the same, but bigger.

Go build something cool. 🚀