Understanding Gradients - An Introduction to Transformers

The Gap Between Using and Understanding¶

There’s a gap between using something and understanding it. You can drive a car without knowing how combustion engines work. You can use PyTorch without knowing what loss.backward() actually computes.

But if you want to design cars, you need thermodynamics. And if you want to design new architectures, debug training instabilities, or just satisfy your own curiosity about how these models actually learn... you need to understand the math.

This chapter closes that gap.

What We’re Going to Do¶

We’re going to calculate—by hand, step by step—a complete training iteration through a transformer.

Not a simplified version. Not pseudocode. The actual math, with actual numbers, showing every matrix multiplication, every activation function, every gradient.

We’ll take the sentence “I like transformers” through a tiny language model:

Forward pass: Convert text to numbers, flow through attention and feed-forward layers, compute how wrong our predictions are
Backward pass: Calculate gradients—how much each parameter contributed to the error
Optimization: Update all ~2,600 parameters to make the model slightly less wrong

By the end, you’ll understand exactly what happens inside a transformer. Not abstractly. Concretely, with numbers you can verify yourself.

A Tiny Model (Same Math, Smaller Numbers)¶

Real transformers are huge. GPT-3 has 175 billion parameters. You can’t write out a 12,288-dimensional vector by hand.

So we’re building a tiny transformer with the exact same architecture, just scaled down to human-tractable dimensions:

What	Our Model	GPT-3	Why This Matters
Embedding dimension	16	12,288	Small enough to print full matrices
Attention heads	2	96	Enough to show multi-head mechanics
Feed-forward hidden size	64	49,152	Standard 4× expansion ratio
Vocabulary	6 tokens	50,257 tokens	Just: PAD, BOS, EOS, I, like, transformers
Layers	1	96	One block shows everything; more layers just repeat
Total parameters	~2,600	175 billion	We can track every single one

The math is identical. When you multiply a 5×16 matrix by a 16×8 matrix, the operation is the same whether those 16s are 16 or 12,288. We’re just keeping the numbers small enough that you can see what’s happening.

The Architecture: GPT-Style Decoder-Only Transformer¶

We’re using the same architecture as GPT, Claude, and LLaMA—a “decoder-only” transformer. (The “decoder-only” part means it generates text left-to-right, predicting one token at a time. BERT uses both directions; GPT-style models only look backward.)

Here’s the high-level flow:

Text: "I like transformers"
         ↓
    [Tokenization]     → Convert words to token IDs
         ↓
    [Embeddings]       → Look up vectors for each token + position
         ↓
    [Self-Attention]   → Each token looks at previous tokens
         ↓
    [Feed-Forward]     → Process each position independently
         ↓
    [Layer Norm]       → Normalize activations (with residual connections)
         ↓
    [Output Projection] → Convert back to vocabulary-sized predictions
         ↓
    [Loss Calculation]  → How wrong are we?

We’ll spend one notebook on each major step, showing every calculation.

The Training Loop: Forward, Backward, Update¶

Training a neural network is conceptually simple. You repeat three steps:

1. Forward pass: Run the input through the model, get predictions, measure error (the “loss”)

2. Backward pass: For every parameter in the model, calculate: “if I nudge this parameter slightly, how much does the loss change?” This is the gradient.

3. Update: Nudge every parameter in the direction that reduces the loss.

Repeat a few million times. The loss gets smaller. The model gets smarter.

That’s the entire algorithm. The complexity is in the details—and we’re going to see every detail.

What You’ll Need to Follow Along¶

Math background: Basic calculus (derivatives, chain rule, partial derivatives) and linear algebra (matrix multiplication, vectors). If you remember what a dot product is and can take a derivative, you’re good.

Programming: We use pure Python—no NumPy, no PyTorch. Everything is explicit lists and loops so you can see exactly what’s happening. (This is intentionally inefficient. We’re optimizing for clarity, not speed.)

Patience: Some notebooks have a lot of numbers. That’s the point. You don’t have to verify every calculation, but knowing you could is what makes this different from a high-level explanation.

You don’t need a PhD. You don’t need to have trained models before. You just need to be willing to follow the math step by step.

Chapter Overview¶

Forward Pass (Notebooks 01-07)¶

Notebook	What We Calculate
01 - Tokenization & Embeddings	Convert “I like transformers” to vectors
02 - QKV Projections	Create Query, Key, Value representations for attention
03 - Attention Scores	Compute how much each token attends to others
04 - Multi-Head Attention	Combine multiple attention “perspectives”
05 - Feed-Forward Network	Apply non-linear transformations
06 - Layer Normalization	Stabilize activations with residual connections
07 - Cross-Entropy Loss	Measure prediction error

Backward Pass (Notebooks 08-09)¶

Notebook	What We Calculate
08 - Loss Gradients	Gradient of loss with respect to output logits
09 - Backpropagation	Gradients for every layer via chain rule

Optimization (Notebook 10)¶

Notebook	What We Calculate
10 - AdamW Optimizer	Update all parameters using adaptive learning rates

Let’s Begin¶

Each notebook builds on the previous one, so going in order is recommended for your first read. All calculations are executable—run the cells, change the numbers, see what happens.

Ready? Let’s start by converting text into numbers.