The Architecture That Changed Everything¶
In 2017, a team at Google published a paper with an unusually bold title: “Attention Is All You Need.” The paper introduced the transformer architecture, and it’s no exaggeration to say it changed the trajectory of artificial intelligence.
Before transformers, the dominant approach for language tasks was recurrent neural networks (RNNs). RNNs process text sequentially—one word at a time, left to right—maintaining a hidden state that carries information forward. It’s intuitive: that’s how we read, after all.
But sequential processing has a fatal flaw: it’s slow. You can’t start processing word 5 until you’ve finished word 4. On modern GPUs—which excel at parallel computation—this is a massive waste. Training large RNNs took weeks.
The transformer’s key insight was to replace recurrence with attention: a mechanism that lets every position look at every other position simultaneously. Instead of processing sequentially, transformers process all positions in parallel, using learned attention patterns to capture relationships. A sentence that took an RNN 100 sequential steps can be processed by a transformer in one parallel operation.
This parallelism unlocked scale. Suddenly you could train on billions of tokens. GPT-2, GPT-3, GPT-4, Claude, LLaMA, Gemini—all transformers. The architecture proved so effective that it’s now used not just for language, but for images, audio, video, and even protein structures.
What We’re Building¶
In this chapter, we’ll build a complete transformer from scratch in PyTorch. Not a toy model that skips the hard parts—the real thing, with every component implemented and explained.
We’re building a decoder-only transformer (the architecture used by GPT, Claude, and LLaMA). “Decoder-only” means it generates text autoregressively—predicting one token at a time, each prediction conditioned on all previous tokens. The original transformer paper had both an encoder and decoder (for translation); modern language models found that decoder-only works great and is simpler.
Here’s the high-level architecture:
Input tokens: [The, cat, sat, on, the]
↓
┌─────────────────────────────────────┐
│ Token Embedding + Position │ Convert tokens to vectors
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Transformer Block × N │ The core of the model
│ ┌────────────────────────────────┐ │
│ │ Multi-Head Self-Attention │ │ Tokens communicate
│ │ + Residual + LayerNorm │ │
│ └────────────────────────────────┘ │
│ ┌────────────────────────────────┐ │
│ │ Feed-Forward Network │ │ Tokens compute
│ │ + Residual + LayerNorm │ │
│ └────────────────────────────────┘ │
└─────────────────────────────────────┘
↓
┌─────────────────────────────────────┐
│ Final LayerNorm + Linear │ Project to vocabulary
└─────────────────────────────────────┘
↓
Output logits: [0.1, 0.3, 8.2, ...] Scores for each possible next tokenEach transformer block contains two sub-layers:
Multi-Head Self-Attention: Where tokens “talk” to each other. Each position can attend to all previous positions, gathering relevant context.
Feed-Forward Network: Where tokens “think” independently. A two-layer MLP processes each position’s representation.
Both sub-layers use residual connections (adding the input to the output) and layer normalization (stabilizing activations). These aren’t optional nice-to-haves—they’re essential for training deep networks.
The Learning Path¶
We’ll build each component step by step:
| Notebook | Topic | What You’ll Learn |
|---|---|---|
| 01 | Token Embeddings | How text becomes vectors; positional encoding approaches (learned, ALiBi, RoPE) |
| 02 | Attention | The Q, K, V mechanism; scaled dot-product attention; causal masking |
| 03 | Multi-Head Attention | Running parallel attention heads; why multiple heads help |
| 04 | Feed-Forward Networks | The “thinking” layer; GELU activation; the 4× expansion pattern |
| 05 | Transformer Block | Combining attention + FFN with residuals and layer norm |
| 06 | Complete Model | Stacking blocks; the full forward pass; parameter counting |
| 07 | Training | Gradient accumulation; validation; the training loop |
| 08 | KV-Cache | Efficient inference; trading memory for speed |
| 09 | Interpretability | Looking inside: attention patterns, logit lens, activation patching |
Each notebook builds on the previous ones. By the end, you’ll have a working transformer that you could (with enough compute) train on real text.
Key Hyperparameters¶
Before diving in, let’s establish the key dimensions that define a transformer. Understanding these will help you read the code and scale models up or down.
| Parameter | Symbol | Description | Our Model | GPT-2 Small | GPT-3 |
|---|---|---|---|---|---|
| Embedding dimension | Size of token representations | 256 | 768 | 12,288 | |
| Number of heads | Parallel attention mechanisms | 4 | 12 | 96 | |
| Head dimension | Size per attention head | 64 | 64 | 128 | |
| Number of layers | Stacked transformer blocks | 4 | 12 | 96 | |
| FFN hidden size | Feed-forward inner dimension | 1024 | 3072 | 49,152 | |
| Vocabulary size | $ | V | $ | Number of unique tokens | 10,000 |
| Context length | Maximum sequence length | 512 | 1,024 | 2,048 |
A few patterns to notice:
: The feed-forward layer expands to 4× the embedding size, then projects back down. This expansion gives the model more capacity to learn complex transformations.
: Each attention head operates on a slice of the embedding. More heads means each head is smaller, but you get more parallel “perspectives.”
Depth scales with width: Larger models use both more layers and larger dimensions. GPT-3’s 96 layers would be unstable with GPT-2’s smaller dimensions.
# Let's verify PyTorch is set up correctly
import torch
import torch.nn as nn
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA device: {torch.cuda.get_device_name(0)}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
print("MPS (Apple Silicon) available")
else:
print("Running on CPU")PyTorch version: 2.10.0.dev20251124+rocm7.1
CUDA available: True
CUDA device: Radeon RX 7900 XTX
# Our model configuration
config = {
'd_model': 256, # Embedding dimension
'n_heads': 4, # Number of attention heads
'n_layers': 4, # Number of transformer blocks
'd_ff': 1024, # Feed-forward hidden dimension (4 × d_model)
'vocab_size': 10000, # Vocabulary size
'max_seq_len': 512, # Maximum sequence length
'dropout': 0.1, # Dropout rate
}
print("Model Configuration")
print("=" * 40)
for k, v in config.items():
print(f" {k:15} = {v}")Model Configuration
========================================
d_model = 256
n_heads = 4
n_layers = 4
d_ff = 1024
vocab_size = 10000
max_seq_len = 512
dropout = 0.1
Parameter Count: Where Do All the Numbers Live?¶
Understanding where parameters come from helps you reason about model capacity and memory. Let’s count them:
Token Embeddings:
One -dimensional vector per vocabulary token
For us: parameters
Position Embeddings (if learned):
One vector per position
For us: parameters
Per Transformer Block:
Attention: (for Q, K, V, and output projections)
FFN: (up projection and down projection)
LayerNorm: (two layer norms, each with scale and shift)
Output Projection:
Projects from embedding space back to vocabulary (often tied with input embeddings)
# Let's count parameters for our configuration
d_model = config['d_model']
n_heads = config['n_heads']
n_layers = config['n_layers']
d_ff = config['d_ff']
vocab_size = config['vocab_size']
max_seq_len = config['max_seq_len']
# Embeddings
token_embed_params = vocab_size * d_model
pos_embed_params = max_seq_len * d_model
# Per block
attention_params = 4 * d_model * d_model # Q, K, V, O projections (ignoring biases for simplicity)
ffn_params = 2 * d_model * d_ff # Up and down projections
layernorm_params = 4 * d_model # 2 layer norms × (scale + shift)
block_params = attention_params + ffn_params + layernorm_params
# Total
total_block_params = n_layers * block_params
output_proj_params = d_model * vocab_size # Often tied with token embeddings
final_ln_params = 2 * d_model
total_params = token_embed_params + pos_embed_params + total_block_params + final_ln_params
# Note: if we tie embeddings, we don't count output_proj_params separately
print("Parameter Count Breakdown")
print("=" * 50)
print(f"Token embeddings: {token_embed_params:>12,} ({token_embed_params/1e6:.2f}M)")
print(f"Position embeddings: {pos_embed_params:>12,} ({pos_embed_params/1e6:.2f}M)")
print()
print(f"Per transformer block:")
print(f" Attention (Q,K,V,O): {attention_params:>12,}")
print(f" Feed-forward: {ffn_params:>12,}")
print(f" LayerNorm (×2): {layernorm_params:>12,}")
print(f" Block total: {block_params:>12,}")
print()
print(f"All {n_layers} blocks: {total_block_params:>12,} ({total_block_params/1e6:.2f}M)")
print(f"Final LayerNorm: {final_ln_params:>12,}")
print("=" * 50)
print(f"Total parameters: {total_params:>12,} ({total_params/1e6:.2f}M)")Parameter Count Breakdown
==================================================
Token embeddings: 2,560,000 (2.56M)
Position embeddings: 131,072 (0.13M)
Per transformer block:
Attention (Q,K,V,O): 262,144
Feed-forward: 524,288
LayerNorm (×2): 1,024
Block total: 787,456
All 4 blocks: 3,149,824 (3.15M)
Final LayerNorm: 512
==================================================
Total parameters: 5,841,408 (5.84M)
A Note on Scale¶
Our model is tiny—about 5-6 million parameters. For context:
| Model | Parameters | Relative Size |
|---|---|---|
| Our model | ~5M | 1× |
| GPT-2 Small | 117M | 20× |
| GPT-2 Large | 774M | 130× |
| GPT-3 | 175B | 30,000× |
| GPT-4 | ~1.8T (rumored) | 300,000× |
The beauty of this: the architecture is identical. The same attention mechanism, the same feed-forward structure, the same residual connections. GPT-3 is just our model with bigger matrices and more layers.
Understanding our 5M parameter model means understanding GPT-3. The math scales; the concepts don’t change.
What You’ll Need¶
Prerequisites:
Basic Python and PyTorch (tensors, modules, autograd)
Linear algebra fundamentals (matrix multiplication, transpose, dot products)
Some calculus (we’ll compute gradients, but PyTorch does the heavy lifting)
Mindset:
Run the code cells—this is meant to be interactive
Modify things and see what breaks
The goal is understanding, not just running
Each notebook is self-contained with working code. You can run them in order, or jump to a specific topic if you want to understand one component in isolation.
Let’s Build¶
We’ll start where every language model starts: turning text into numbers.
In the next notebook, we’ll implement token embeddings and explore three different approaches to positional encoding—learned embeddings (GPT-2 style), ALiBi (BLOOM style), and RoPE (LLaMA style). Each has its trade-offs, and understanding them will give you intuition for how transformers represent sequences.
Ready? Let’s go.