The Gap Between Using and Understanding¶
There’s a gap between using something and understanding it. You can drive a car without knowing how combustion engines work. You can use PyTorch without knowing what loss.backward() actually computes.
But if you want to design cars, you need thermodynamics. And if you want to design new architectures, debug training instabilities, or just satisfy your own curiosity about how these models actually learn... you need to understand the math.
This chapter closes that gap.
What We’re Going to Do¶
We’re going to calculate—by hand, step by step—a complete training iteration through a transformer.
Not a simplified version. Not pseudocode. The actual math, with actual numbers, showing every matrix multiplication, every activation function, every gradient.
We’ll take the sentence “I like transformers” through a tiny language model:
Forward pass: Convert text to numbers, flow through attention and feed-forward layers, compute how wrong our predictions are
Backward pass: Calculate gradients—how much each parameter contributed to the error
Optimization: Update all ~2,600 parameters to make the model slightly less wrong
By the end, you’ll understand exactly what happens inside a transformer. Not abstractly. Concretely, with numbers you can verify yourself.
A Tiny Model (Same Math, Smaller Numbers)¶
Real transformers are huge. GPT-3 has 175 billion parameters. You can’t write out a 12,288-dimensional vector by hand.
So we’re building a tiny transformer with the exact same architecture, just scaled down to human-tractable dimensions:
| What | Our Model | GPT-3 | Why This Matters |
|---|---|---|---|
| Embedding dimension | 16 | 12,288 | Small enough to print full matrices |
| Attention heads | 2 | 96 | Enough to show multi-head mechanics |
| Feed-forward hidden size | 64 | 49,152 | Standard 4× expansion ratio |
| Vocabulary | 6 tokens | 50,257 tokens | Just: PAD, BOS, EOS, I, like, transformers |
| Layers | 1 | 96 | One block shows everything; more layers just repeat |
| Total parameters | ~2,600 | 175 billion | We can track every single one |
The math is identical. When you multiply a 5×16 matrix by a 16×8 matrix, the operation is the same whether those 16s are 16 or 12,288. We’re just keeping the numbers small enough that you can see what’s happening.
The Architecture: GPT-Style Decoder-Only Transformer¶
We’re using the same architecture as GPT, Claude, and LLaMA—a “decoder-only” transformer. (The “decoder-only” part means it generates text left-to-right, predicting one token at a time. BERT uses both directions; GPT-style models only look backward.)
Here’s the high-level flow:
Text: "I like transformers"
↓
[Tokenization] → Convert words to token IDs
↓
[Embeddings] → Look up vectors for each token + position
↓
[Self-Attention] → Each token looks at previous tokens
↓
[Feed-Forward] → Process each position independently
↓
[Layer Norm] → Normalize activations (with residual connections)
↓
[Output Projection] → Convert back to vocabulary-sized predictions
↓
[Loss Calculation] → How wrong are we?We’ll spend one notebook on each major step, showing every calculation.
The Training Loop: Forward, Backward, Update¶
Training a neural network is conceptually simple. You repeat three steps:
1. Forward pass: Run the input through the model, get predictions, measure error (the “loss”)
2. Backward pass: For every parameter in the model, calculate: “if I nudge this parameter slightly, how much does the loss change?” This is the gradient.
3. Update: Nudge every parameter in the direction that reduces the loss.
Repeat a few million times. The loss gets smaller. The model gets smarter.
That’s the entire algorithm. The complexity is in the details—and we’re going to see every detail.
What You’ll Need to Follow Along¶
Math background: Basic calculus (derivatives, chain rule, partial derivatives) and linear algebra (matrix multiplication, vectors). If you remember what a dot product is and can take a derivative, you’re good.
Programming: We use pure Python—no NumPy, no PyTorch. Everything is explicit lists and loops so you can see exactly what’s happening. (This is intentionally inefficient. We’re optimizing for clarity, not speed.)
Patience: Some notebooks have a lot of numbers. That’s the point. You don’t have to verify every calculation, but knowing you could is what makes this different from a high-level explanation.
You don’t need a PhD. You don’t need to have trained models before. You just need to be willing to follow the math step by step.
Chapter Overview¶
Forward Pass (Notebooks 01-07)¶
| Notebook | What We Calculate |
|---|---|
| 01 - Tokenization & Embeddings | Convert “I like transformers” to vectors |
| 02 - QKV Projections | Create Query, Key, Value representations for attention |
| 03 - Attention Scores | Compute how much each token attends to others |
| 04 - Multi-Head Attention | Combine multiple attention “perspectives” |
| 05 - Feed-Forward Network | Apply non-linear transformations |
| 06 - Layer Normalization | Stabilize activations with residual connections |
| 07 - Cross-Entropy Loss | Measure prediction error |
Backward Pass (Notebooks 08-09)¶
| Notebook | What We Calculate |
|---|---|
| 08 - Loss Gradients | Gradient of loss with respect to output logits |
| 09 - Backpropagation | Gradients for every layer via chain rule |
Optimization (Notebook 10)¶
| Notebook | What We Calculate |
|---|---|
| 10 - AdamW Optimizer | Update all parameters using adaptive learning rates |
Let’s Begin¶
Each notebook builds on the previous one, so going in order is recommended for your first read. All calculations are executable—run the cells, change the numbers, see what happens.
Ready? Let’s start by converting text into numbers.