Model Interpretability
Now that we’ve built and trained a transformer, how do we understand what it has learned? Mechanistic interpretability provides tools to peek inside the “black box” and discover the circuits and patterns the model uses.
This section covers four powerful techniques: Logit Lens, Attention Analysis, Induction Heads, and Activation Patching.
What is Mechanistic Interpretability?
Section titled “What is Mechanistic Interpretability?”Instead of just asking “does the model work?”, we ask:
- When does the model “know” the answer? (which layer?)
- How does information flow through the network?
- Which components are responsible for specific behaviors?
- What patterns or circuits has the model learned?
This connects to cutting-edge research from Anthropic, OpenAI, and academic labs exploring how LLMs actually work under the hood.
Logit Lens: Seeing Predictions Evolve
Section titled “Logit Lens: Seeing Predictions Evolve”The logit lens technique lets us visualize what the model would predict if we stopped at each layer.
How It Works
Section titled “How It Works”Normally, we only see the final output:
Input → Layer 1 → Layer 2 → ... → Layer N → Unembed → LogitsWith logit lens, we apply unembedding at each layer:
Input → Layer 1 → [Unembed] → "What now?" → Layer 2 → [Unembed] → "What now?" → Layer 3 → [Unembed] → "What now?"Example Insight
Section titled “Example Insight”Input: "The capital of France is"
- Layer 0: “the” (15%), “a” (12%) → Generic, common words
- Layer 2: “located” (18%), “Paris” (15%) → Starting to understand context
- Layer 4: “Paris” (65%), “French” (10%) → Confident, correct answer!
- Layer 6: “Paris” (72%), “France” (8%) → Final refinement
Key Insight: The model “knows” Paris by Layer 4. Later layers just refine the distribution.
Try It Yourself
Section titled “Try It Yourself”# Demo mode - educational examplespython main.py interpret logit-lens checkpoints/model.pt --demo
# Analyze specific textpython main.py interpret logit-lens checkpoints/model.pt \ --text "The Eiffel Tower is in"
# Interactive modepython main.py interpret logit-lens checkpoints/model.pt --interactiveAttention Analysis: What Do Heads Focus On?
Section titled “Attention Analysis: What Do Heads Focus On?”The attention analysis tool reveals what each attention head is looking at when processing text.
What We Discover
Section titled “What We Discover”Attention weights show which tokens each position “attends to”. By analyzing these patterns across heads, we discover specialized behaviors:
- Previous token heads: Always look at position i-1
- Uniform heads: Spread attention evenly (averaging information)
- Start token heads: Focus on the beginning of the sequence
- Sparse heads: Concentrate on very few key tokens
Example Discovery
Section titled “Example Discovery”Input: "The cat sat on the mat"
Head 2.3 (Previous Token):
- “cat” attends to “The” (100%)
- “sat” attends to “cat” (100%)
- → Implements a previous-token circuit!
Head 4.1 (Uniform):
- Each token: 16.7% to all positions
- → Averages information uniformly
Try It Yourself
Section titled “Try It Yourself”# Analyze a specific headpython main.py interpret attention checkpoints/model.pt \ --text "Hello world" --layer 2 --head 3
# Find all previous-token headspython main.py interpret attention checkpoints/model.pt \ --text "Hello world" # Shows pattern summaryInduction Heads: Pattern Matching Circuits
Section titled “Induction Heads: Pattern Matching Circuits”The induction head detector finds circuits that implement in-context learning - the ability to copy from earlier patterns.
What Are Induction Heads?
Section titled “What Are Induction Heads?”Given a repeated pattern like:
Input: "A B C ... A B [?]"Prediction: "C"Induction heads learn to predict C by recognizing the repeated “A B” pattern and copying what came after the first occurrence.
The Circuit
Section titled “The Circuit”Induction typically involves two heads working together:
-
Previous Token Head (Layer L):
- At position i, attends to i-1
- Creates representation of “what came before”
-
Induction Head (Layer L+1):
- Queries for matches to previous token
- Attends to what came AFTER those matches
- Predicts the next token
Try It Yourself
Section titled “Try It Yourself”# Detect induction heads across all layerspython main.py interpret induction-heads checkpoints/model.pt
# Custom parameters: fewer tests, longer sequencespython main.py interpret induction-heads checkpoints/model.pt \ --num-sequences 50 --seq-length 40 --top-k 10Activation Patching: Causal Interventions
Section titled “Activation Patching: Causal Interventions”The activation patching tool performs causal experiments to identify which components are truly responsible for specific behaviors.
The Question
Section titled “The Question”We can observe what the model does, but which parts are actually causing the behavior?
Activation patching answers this through intervention experiments:
- Run model on “clean” input (correct behavior)
- Run model on “corrupted” input (incorrect behavior)
- For each component, swap clean activations into corrupted run
- Measure how much this restores correct behavior
High recovery = that component is causally important!
Example Experiment
Section titled “Example Experiment”Clean: "The Eiffel Tower is in" → Predicts: "Paris" (85%)
Corrupted: "The Empire State is in" → Predicts: "New York" (78%)
Test: Patch Layer 4 activations from clean → corrupted → Predicts: "Paris" (82%)
Result: Layer 4 recovery = 90% Layer 4 is CRITICAL!Try It Yourself
Section titled “Try It Yourself”# Test which layers are causally importantpython main.py interpret patch checkpoints/model.pt \ --clean "The Eiffel Tower is in" \ --corrupted "The Empire State Building is in" \ --target "Paris"Learn More
Section titled “Learn More”Explore the implementation files - each includes comprehensive documentation explaining the theory and methods:
- src/transformer/interpretability/logit_lens.py
- src/transformer/interpretability/attention_analysis.py
- src/transformer/interpretability/induction_heads.py
- src/transformer/interpretability/activation_patching.py
- commands/interpret.py