Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Understanding Gradients

The Gap Between Using and Understanding

There’s a gap between using something and understanding it. You can drive a car without knowing how combustion engines work. You can use PyTorch without knowing what loss.backward() actually computes.

But if you want to design cars, you need thermodynamics. And if you want to design new architectures, debug training instabilities, or just satisfy your own curiosity about how these models actually learn... you need to understand the math.

This chapter closes that gap.

What We’re Going to Do

We’re going to calculate—by hand, step by step—a complete training iteration through a transformer.

Not a simplified version. Not pseudocode. The actual math, with actual numbers, showing every matrix multiplication, every activation function, every gradient.

We’ll take the sentence “I like transformers” through a tiny language model:

  1. Forward pass: Convert text to numbers, flow through attention and feed-forward layers, compute how wrong our predictions are

  2. Backward pass: Calculate gradients—how much each parameter contributed to the error

  3. Optimization: Update all ~2,600 parameters to make the model slightly less wrong

By the end, you’ll understand exactly what happens inside a transformer. Not abstractly. Concretely, with numbers you can verify yourself.

A Tiny Model (Same Math, Smaller Numbers)

Real transformers are huge. GPT-3 has 175 billion parameters. You can’t write out a 12,288-dimensional vector by hand.

So we’re building a tiny transformer with the exact same architecture, just scaled down to human-tractable dimensions:

WhatOur ModelGPT-3Why This Matters
Embedding dimension1612,288Small enough to print full matrices
Attention heads296Enough to show multi-head mechanics
Feed-forward hidden size6449,152Standard 4× expansion ratio
Vocabulary6 tokens50,257 tokensJust: PAD, BOS, EOS, I, like, transformers
Layers196One block shows everything; more layers just repeat
Total parameters~2,600175 billionWe can track every single one

The math is identical. When you multiply a 5×16 matrix by a 16×8 matrix, the operation is the same whether those 16s are 16 or 12,288. We’re just keeping the numbers small enough that you can see what’s happening.

The Architecture: GPT-Style Decoder-Only Transformer

We’re using the same architecture as GPT, Claude, and LLaMA—a “decoder-only” transformer. (The “decoder-only” part means it generates text left-to-right, predicting one token at a time. BERT uses both directions; GPT-style models only look backward.)

Here’s the high-level flow:

Text: "I like transformers"
         ↓
    [Tokenization]     → Convert words to token IDs
         ↓
    [Embeddings]       → Look up vectors for each token + position
         ↓
    [Self-Attention]   → Each token looks at previous tokens
         ↓
    [Feed-Forward]     → Process each position independently
         ↓
    [Layer Norm]       → Normalize activations (with residual connections)
         ↓
    [Output Projection] → Convert back to vocabulary-sized predictions
         ↓
    [Loss Calculation]  → How wrong are we?

We’ll spend one notebook on each major step, showing every calculation.

The Training Loop: Forward, Backward, Update

Training a neural network is conceptually simple. You repeat three steps:

1. Forward pass: Run the input through the model, get predictions, measure error (the “loss”)

2. Backward pass: For every parameter in the model, calculate: “if I nudge this parameter slightly, how much does the loss change?” This is the gradient.

3. Update: Nudge every parameter in the direction that reduces the loss.

Repeat a few million times. The loss gets smaller. The model gets smarter.

That’s the entire algorithm. The complexity is in the details—and we’re going to see every detail.

What You’ll Need to Follow Along

Math background: Basic calculus (derivatives, chain rule, partial derivatives) and linear algebra (matrix multiplication, vectors). If you remember what a dot product is and can take a derivative, you’re good.

Programming: We use pure Python—no NumPy, no PyTorch. Everything is explicit lists and loops so you can see exactly what’s happening. (This is intentionally inefficient. We’re optimizing for clarity, not speed.)

Patience: Some notebooks have a lot of numbers. That’s the point. You don’t have to verify every calculation, but knowing you could is what makes this different from a high-level explanation.

You don’t need a PhD. You don’t need to have trained models before. You just need to be willing to follow the math step by step.

Chapter Overview

Forward Pass (Notebooks 01-07)

NotebookWhat We Calculate
01 - Tokenization & EmbeddingsConvert “I like transformers” to vectors
02 - QKV ProjectionsCreate Query, Key, Value representations for attention
03 - Attention ScoresCompute how much each token attends to others
04 - Multi-Head AttentionCombine multiple attention “perspectives”
05 - Feed-Forward NetworkApply non-linear transformations
06 - Layer NormalizationStabilize activations with residual connections
07 - Cross-Entropy LossMeasure prediction error

Backward Pass (Notebooks 08-09)

NotebookWhat We Calculate
08 - Loss GradientsGradient of loss with respect to output logits
09 - BackpropagationGradients for every layer via chain rule

Optimization (Notebook 10)

NotebookWhat We Calculate
10 - AdamW OptimizerUpdate all parameters using adaptive learning rates

Let’s Begin

Each notebook builds on the previous one, so going in order is recommended for your first read. All calculations are executable—run the cells, change the numbers, see what happens.

Ready? Let’s start by converting text into numbers.