Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Building a Transformer from Scratch

The Architecture That Changed Everything

In 2017, a team at Google published a paper with an unusually bold title: “Attention Is All You Need.” The paper introduced the transformer architecture, and it’s no exaggeration to say it changed the trajectory of artificial intelligence.

Before transformers, the dominant approach for language tasks was recurrent neural networks (RNNs). RNNs process text sequentially—one word at a time, left to right—maintaining a hidden state that carries information forward. It’s intuitive: that’s how we read, after all.

But sequential processing has a fatal flaw: it’s slow. You can’t start processing word 5 until you’ve finished word 4. On modern GPUs—which excel at parallel computation—this is a massive waste. Training large RNNs took weeks.

The transformer’s key insight was to replace recurrence with attention: a mechanism that lets every position look at every other position simultaneously. Instead of processing sequentially, transformers process all positions in parallel, using learned attention patterns to capture relationships. A sentence that took an RNN 100 sequential steps can be processed by a transformer in one parallel operation.

This parallelism unlocked scale. Suddenly you could train on billions of tokens. GPT-2, GPT-3, GPT-4, Claude, LLaMA, Gemini—all transformers. The architecture proved so effective that it’s now used not just for language, but for images, audio, video, and even protein structures.

What We’re Building

In this chapter, we’ll build a complete transformer from scratch in PyTorch. Not a toy model that skips the hard parts—the real thing, with every component implemented and explained.

We’re building a decoder-only transformer (the architecture used by GPT, Claude, and LLaMA). “Decoder-only” means it generates text autoregressively—predicting one token at a time, each prediction conditioned on all previous tokens. The original transformer paper had both an encoder and decoder (for translation); modern language models found that decoder-only works great and is simpler.

Here’s the high-level architecture:

Input tokens: [The, cat, sat, on, the]
                    ↓
┌─────────────────────────────────────┐
│     Token Embedding + Position      │  Convert tokens to vectors
└─────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────┐
│       Transformer Block × N         │  The core of the model
│  ┌────────────────────────────────┐ │
│  │   Multi-Head Self-Attention    │ │  Tokens communicate
│  │   + Residual + LayerNorm       │ │
│  └────────────────────────────────┘ │
│  ┌────────────────────────────────┐ │
│  │   Feed-Forward Network         │ │  Tokens compute
│  │   + Residual + LayerNorm       │ │
│  └────────────────────────────────┘ │
└─────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────┐
│     Final LayerNorm + Linear        │  Project to vocabulary
└─────────────────────────────────────┘
                    ↓
Output logits: [0.1, 0.3, 8.2, ...]    Scores for each possible next token

Each transformer block contains two sub-layers:

  1. Multi-Head Self-Attention: Where tokens “talk” to each other. Each position can attend to all previous positions, gathering relevant context.

  2. Feed-Forward Network: Where tokens “think” independently. A two-layer MLP processes each position’s representation.

Both sub-layers use residual connections (adding the input to the output) and layer normalization (stabilizing activations). These aren’t optional nice-to-haves—they’re essential for training deep networks.

The Learning Path

We’ll build each component step by step:

NotebookTopicWhat You’ll Learn
01Token EmbeddingsHow text becomes vectors; positional encoding approaches (learned, ALiBi, RoPE)
02AttentionThe Q, K, V mechanism; scaled dot-product attention; causal masking
03Multi-Head AttentionRunning parallel attention heads; why multiple heads help
04Feed-Forward NetworksThe “thinking” layer; GELU activation; the 4× expansion pattern
05Transformer BlockCombining attention + FFN with residuals and layer norm
06Complete ModelStacking blocks; the full forward pass; parameter counting
07TrainingGradient accumulation; validation; the training loop
08KV-CacheEfficient inference; trading memory for speed
09InterpretabilityLooking inside: attention patterns, logit lens, activation patching

Each notebook builds on the previous ones. By the end, you’ll have a working transformer that you could (with enough compute) train on real text.

Key Hyperparameters

Before diving in, let’s establish the key dimensions that define a transformer. Understanding these will help you read the code and scale models up or down.

ParameterSymbolDescriptionOur ModelGPT-2 SmallGPT-3
Embedding dimensiondmodeld_{model}Size of token representations25676812,288
Number of headsnheadsn_{heads}Parallel attention mechanisms41296
Head dimensiondk=dmodel/nheadsd_k = d_{model}/n_{heads}Size per attention head6464128
Number of layersnlayersn_{layers}Stacked transformer blocks41296
FFN hidden sizedffd_{ff}Feed-forward inner dimension1024307249,152
Vocabulary size$V$Number of unique tokens10,000
Context lengthnctxn_{ctx}Maximum sequence length5121,0242,048

A few patterns to notice:

  • dff4×dmodeld_{ff} \approx 4 \times d_{model}: The feed-forward layer expands to 4× the embedding size, then projects back down. This expansion gives the model more capacity to learn complex transformations.

  • dk=dmodel/nheadsd_k = d_{model} / n_{heads}: Each attention head operates on a slice of the embedding. More heads means each head is smaller, but you get more parallel “perspectives.”

  • Depth scales with width: Larger models use both more layers and larger dimensions. GPT-3’s 96 layers would be unstable with GPT-2’s smaller dimensions.

# Let's verify PyTorch is set up correctly
import torch
import torch.nn as nn

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    print("MPS (Apple Silicon) available")
else:
    print("Running on CPU")
PyTorch version: 2.10.0.dev20251124+rocm7.1
CUDA available: True
CUDA device: Radeon RX 7900 XTX
# Our model configuration
config = {
    'd_model': 256,       # Embedding dimension
    'n_heads': 4,         # Number of attention heads
    'n_layers': 4,        # Number of transformer blocks
    'd_ff': 1024,         # Feed-forward hidden dimension (4 × d_model)
    'vocab_size': 10000,  # Vocabulary size
    'max_seq_len': 512,   # Maximum sequence length
    'dropout': 0.1,       # Dropout rate
}

print("Model Configuration")
print("=" * 40)
for k, v in config.items():
    print(f"  {k:15} = {v}")
Model Configuration
========================================
  d_model         = 256
  n_heads         = 4
  n_layers        = 4
  d_ff            = 1024
  vocab_size      = 10000
  max_seq_len     = 512
  dropout         = 0.1

Parameter Count: Where Do All the Numbers Live?

Understanding where parameters come from helps you reason about model capacity and memory. Let’s count them:

Token Embeddings: V×dmodel|V| \times d_{model}

  • One dmodeld_{model}-dimensional vector per vocabulary token

  • For us: 10,000×256=2,560,00010,000 \times 256 = 2,560,000 parameters

Position Embeddings (if learned): nctx×dmodeln_{ctx} \times d_{model}

  • One vector per position

  • For us: 512×256=131,072512 \times 256 = 131,072 parameters

Per Transformer Block:

  • Attention: 4×dmodel24 \times d_{model}^2 (for Q, K, V, and output projections)

  • FFN: 2×dmodel×dff2 \times d_{model} \times d_{ff} (up projection and down projection)

  • LayerNorm: 4×dmodel4 \times d_{model} (two layer norms, each with scale and shift)

Output Projection: dmodel×Vd_{model} \times |V|

  • Projects from embedding space back to vocabulary (often tied with input embeddings)

# Let's count parameters for our configuration
d_model = config['d_model']
n_heads = config['n_heads']
n_layers = config['n_layers']
d_ff = config['d_ff']
vocab_size = config['vocab_size']
max_seq_len = config['max_seq_len']

# Embeddings
token_embed_params = vocab_size * d_model
pos_embed_params = max_seq_len * d_model

# Per block
attention_params = 4 * d_model * d_model  # Q, K, V, O projections (ignoring biases for simplicity)
ffn_params = 2 * d_model * d_ff           # Up and down projections
layernorm_params = 4 * d_model            # 2 layer norms × (scale + shift)
block_params = attention_params + ffn_params + layernorm_params

# Total
total_block_params = n_layers * block_params
output_proj_params = d_model * vocab_size  # Often tied with token embeddings
final_ln_params = 2 * d_model

total_params = token_embed_params + pos_embed_params + total_block_params + final_ln_params
# Note: if we tie embeddings, we don't count output_proj_params separately

print("Parameter Count Breakdown")
print("=" * 50)
print(f"Token embeddings:      {token_embed_params:>12,} ({token_embed_params/1e6:.2f}M)")
print(f"Position embeddings:   {pos_embed_params:>12,} ({pos_embed_params/1e6:.2f}M)")
print()
print(f"Per transformer block:")
print(f"  Attention (Q,K,V,O): {attention_params:>12,}")
print(f"  Feed-forward:        {ffn_params:>12,}")
print(f"  LayerNorm (×2):      {layernorm_params:>12,}")
print(f"  Block total:         {block_params:>12,}")
print()
print(f"All {n_layers} blocks:          {total_block_params:>12,} ({total_block_params/1e6:.2f}M)")
print(f"Final LayerNorm:       {final_ln_params:>12,}")
print("=" * 50)
print(f"Total parameters:      {total_params:>12,} ({total_params/1e6:.2f}M)")
Parameter Count Breakdown
==================================================
Token embeddings:         2,560,000 (2.56M)
Position embeddings:        131,072 (0.13M)

Per transformer block:
  Attention (Q,K,V,O):      262,144
  Feed-forward:             524,288
  LayerNorm (×2):             1,024
  Block total:              787,456

All 4 blocks:             3,149,824 (3.15M)
Final LayerNorm:                512
==================================================
Total parameters:         5,841,408 (5.84M)

A Note on Scale

Our model is tiny—about 5-6 million parameters. For context:

ModelParametersRelative Size
Our model~5M
GPT-2 Small117M20×
GPT-2 Large774M130×
GPT-3175B30,000×
GPT-4~1.8T (rumored)300,000×

The beauty of this: the architecture is identical. The same attention mechanism, the same feed-forward structure, the same residual connections. GPT-3 is just our model with bigger matrices and more layers.

Understanding our 5M parameter model means understanding GPT-3. The math scales; the concepts don’t change.

What You’ll Need

Prerequisites:

  • Basic Python and PyTorch (tensors, modules, autograd)

  • Linear algebra fundamentals (matrix multiplication, transpose, dot products)

  • Some calculus (we’ll compute gradients, but PyTorch does the heavy lifting)

Mindset:

  • Run the code cells—this is meant to be interactive

  • Modify things and see what breaks

  • The goal is understanding, not just running

Each notebook is self-contained with working code. You can run them in order, or jump to a specific topic if you want to understand one component in isolation.

Let’s Build

We’ll start where every language model starts: turning text into numbers.

In the next notebook, we’ll implement token embeddings and explore three different approaches to positional encoding—learned embeddings (GPT-2 style), ALiBi (BLOOM style), and RoPE (LLaMA style). Each has its trade-offs, and understanding them will give you intuition for how transformers represent sequences.

Ready? Let’s go.