Skip to content

Model Interpretability

Now that we’ve built and trained a transformer, how do we understand what it has learned? Mechanistic interpretability provides tools to peek inside the “black box” and discover the circuits and patterns the model uses.

This section covers four powerful techniques: Logit Lens, Attention Analysis, Induction Heads, and Activation Patching.

Instead of just asking “does the model work?”, we ask:

  • When does the model “know” the answer? (which layer?)
  • How does information flow through the network?
  • Which components are responsible for specific behaviors?
  • What patterns or circuits has the model learned?

This connects to cutting-edge research from Anthropic, OpenAI, and academic labs exploring how LLMs actually work under the hood.

The logit lens technique lets us visualize what the model would predict if we stopped at each layer.

Normally, we only see the final output:

Input → Layer 1 → Layer 2 → ... → Layer N → Unembed → Logits

With logit lens, we apply unembedding at each layer:

Input → Layer 1 → [Unembed] → "What now?"
→ Layer 2 → [Unembed] → "What now?"
→ Layer 3 → [Unembed] → "What now?"

Input: "The capital of France is"

  • Layer 0: “the” (15%), “a” (12%) → Generic, common words
  • Layer 2: “located” (18%), “Paris” (15%) → Starting to understand context
  • Layer 4: “Paris” (65%), “French” (10%) → Confident, correct answer!
  • Layer 6: “Paris” (72%), “France” (8%) → Final refinement

Key Insight: The model “knows” Paris by Layer 4. Later layers just refine the distribution.

Terminal window
# Demo mode - educational examples
python main.py interpret logit-lens checkpoints/model.pt --demo
# Analyze specific text
python main.py interpret logit-lens checkpoints/model.pt \
--text "The Eiffel Tower is in"
# Interactive mode
python main.py interpret logit-lens checkpoints/model.pt --interactive

Attention Analysis: What Do Heads Focus On?

Section titled “Attention Analysis: What Do Heads Focus On?”

The attention analysis tool reveals what each attention head is looking at when processing text.

Attention weights show which tokens each position “attends to”. By analyzing these patterns across heads, we discover specialized behaviors:

  • Previous token heads: Always look at position i-1
  • Uniform heads: Spread attention evenly (averaging information)
  • Start token heads: Focus on the beginning of the sequence
  • Sparse heads: Concentrate on very few key tokens

Input: "The cat sat on the mat"

Head 2.3 (Previous Token):

  • “cat” attends to “The” (100%)
  • “sat” attends to “cat” (100%)
  • → Implements a previous-token circuit!

Head 4.1 (Uniform):

  • Each token: 16.7% to all positions
  • → Averages information uniformly
Terminal window
# Analyze a specific head
python main.py interpret attention checkpoints/model.pt \
--text "Hello world" --layer 2 --head 3
# Find all previous-token heads
python main.py interpret attention checkpoints/model.pt \
--text "Hello world" # Shows pattern summary

Induction Heads: Pattern Matching Circuits

Section titled “Induction Heads: Pattern Matching Circuits”

The induction head detector finds circuits that implement in-context learning - the ability to copy from earlier patterns.

Given a repeated pattern like:

Input: "A B C ... A B [?]"
Prediction: "C"

Induction heads learn to predict C by recognizing the repeated “A B” pattern and copying what came after the first occurrence.

Induction typically involves two heads working together:

  1. Previous Token Head (Layer L):

    • At position i, attends to i-1
    • Creates representation of “what came before”
  2. Induction Head (Layer L+1):

    • Queries for matches to previous token
    • Attends to what came AFTER those matches
    • Predicts the next token
Terminal window
# Detect induction heads across all layers
python main.py interpret induction-heads checkpoints/model.pt
# Custom parameters: fewer tests, longer sequences
python main.py interpret induction-heads checkpoints/model.pt \
--num-sequences 50 --seq-length 40 --top-k 10

The activation patching tool performs causal experiments to identify which components are truly responsible for specific behaviors.

We can observe what the model does, but which parts are actually causing the behavior?

Activation patching answers this through intervention experiments:

  1. Run model on “clean” input (correct behavior)
  2. Run model on “corrupted” input (incorrect behavior)
  3. For each component, swap clean activations into corrupted run
  4. Measure how much this restores correct behavior

High recovery = that component is causally important!

Clean: "The Eiffel Tower is in"
→ Predicts: "Paris" (85%)
Corrupted: "The Empire State is in"
→ Predicts: "New York" (78%)
Test: Patch Layer 4 activations
from clean → corrupted
→ Predicts: "Paris" (82%)
Result: Layer 4 recovery = 90%
Layer 4 is CRITICAL!
Terminal window
# Test which layers are causally important
python main.py interpret patch checkpoints/model.pt \
--clean "The Eiffel Tower is in" \
--corrupted "The Empire State Building is in" \
--target "Paris"

Explore the implementation files - each includes comprehensive documentation explaining the theory and methods: