Building a Transformer from Scratch - An Introduction to Transformers

The Architecture That Changed Everything¶

In 2017, a team at Google published a paper with an unusually bold title: “Attention Is All You Need.” The paper introduced the transformer architecture, and it’s no exaggeration to say it changed the trajectory of artificial intelligence.

Before transformers, the dominant approach for language tasks was recurrent neural networks (RNNs). RNNs process text sequentially—one word at a time, left to right—maintaining a hidden state that carries information forward. It’s intuitive: that’s how we read, after all.

But sequential processing has a fatal flaw: it’s slow. You can’t start processing word 5 until you’ve finished word 4. On modern GPUs—which excel at parallel computation—this is a massive waste. Training large RNNs took weeks.

The transformer’s key insight was to replace recurrence with attention: a mechanism that lets every position look at every other position simultaneously. Instead of processing sequentially, transformers process all positions in parallel, using learned attention patterns to capture relationships. A sentence that took an RNN 100 sequential steps can be processed by a transformer in one parallel operation.

This parallelism unlocked scale. Suddenly you could train on billions of tokens. GPT-2, GPT-3, GPT-4, Claude, LLaMA, Gemini—all transformers. The architecture proved so effective that it’s now used not just for language, but for images, audio, video, and even protein structures.

What We’re Building¶

In this chapter, we’ll build a complete transformer from scratch in PyTorch. Not a toy model that skips the hard parts—the real thing, with every component implemented and explained.

We’re building a decoder-only transformer (the architecture used by GPT, Claude, and LLaMA). “Decoder-only” means it generates text autoregressively—predicting one token at a time, each prediction conditioned on all previous tokens. The original transformer paper had both an encoder and decoder (for translation); modern language models found that decoder-only works great and is simpler.

Here’s the high-level architecture:

Input tokens: [The, cat, sat, on, the]
                    ↓
┌─────────────────────────────────────┐
│     Token Embedding + Position      │  Convert tokens to vectors
└─────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────┐
│       Transformer Block × N         │  The core of the model
│  ┌────────────────────────────────┐ │
│  │   Multi-Head Self-Attention    │ │  Tokens communicate
│  │   + Residual + LayerNorm       │ │
│  └────────────────────────────────┘ │
│  ┌────────────────────────────────┐ │
│  │   Feed-Forward Network         │ │  Tokens compute
│  │   + Residual + LayerNorm       │ │
│  └────────────────────────────────┘ │
└─────────────────────────────────────┘
                    ↓
┌─────────────────────────────────────┐
│     Final LayerNorm + Linear        │  Project to vocabulary
└─────────────────────────────────────┘
                    ↓
Output logits: [0.1, 0.3, 8.2, ...]    Scores for each possible next token

Each transformer block contains two sub-layers:

Multi-Head Self-Attention: Where tokens “talk” to each other. Each position can attend to all previous positions, gathering relevant context.
Feed-Forward Network: Where tokens “think” independently. A two-layer MLP processes each position’s representation.

Both sub-layers use residual connections (adding the input to the output) and layer normalization (stabilizing activations). These aren’t optional nice-to-haves—they’re essential for training deep networks.

The Learning Path¶

We’ll build each component step by step:

Notebook	Topic	What You’ll Learn
01	Token Embeddings	How text becomes vectors; positional encoding approaches (learned, ALiBi, RoPE)
02	Attention	The Q, K, V mechanism; scaled dot-product attention; causal masking
03	Multi-Head Attention	Running parallel attention heads; why multiple heads help
04	Feed-Forward Networks	The “thinking” layer; GELU activation; the 4× expansion pattern
05	Transformer Block	Combining attention + FFN with residuals and layer norm
06	Complete Model	Stacking blocks; the full forward pass; parameter counting
07	Training	Gradient accumulation; validation; the training loop
08	KV-Cache	Efficient inference; trading memory for speed
09	Interpretability	Looking inside: attention patterns, logit lens, activation patching

Each notebook builds on the previous ones. By the end, you’ll have a working transformer that you could (with enough compute) train on real text.

Key Hyperparameters¶

Before diving in, let’s establish the key dimensions that define a transformer. Understanding these will help you read the code and scale models up or down.

Parameter	Symbol	Description	Our Model	GPT-2 Small	GPT-3
Embedding dimension	$d_{model}$	Size of token representations	256	768	12,288
Number of heads	$n_{heads}$	Parallel attention mechanisms	4	12	96
Head dimension	$d_k = d_{model}/n_{heads}$	Size per attention head	64	64	128
Number of layers	$n_{layers}$	Stacked transformer blocks	4	12	96
FFN hidden size	$d_{ff}$	Feed-forward inner dimension	1024	3072	49,152
Vocabulary size	$	V	$	Number of unique tokens	10,000
Context length	$n_{ctx}$	Maximum sequence length	512	1,024	2,048

A few patterns to notice:

$d_{ff} \approx 4 \times d_{model}$ : The feed-forward layer expands to 4× the embedding size, then projects back down. This expansion gives the model more capacity to learn complex transformations.
$d_k = d_{model} / n_{heads}$ : Each attention head operates on a slice of the embedding. More heads means each head is smaller, but you get more parallel “perspectives.”
Depth scales with width: Larger models use both more layers and larger dimensions. GPT-3’s 96 layers would be unstable with GPT-2’s smaller dimensions.

# Let's verify PyTorch is set up correctly
import torch
import torch.nn as nn

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    print("MPS (Apple Silicon) available")
else:
    print("Running on CPU")

PyTorch version: 2.10.0.dev20251124+rocm7.1
CUDA available: True
CUDA device: Radeon RX 7900 XTX

# Our model configuration
config = {
    'd_model': 256,       # Embedding dimension
    'n_heads': 4,         # Number of attention heads
    'n_layers': 4,        # Number of transformer blocks
    'd_ff': 1024,         # Feed-forward hidden dimension (4 × d_model)
    'vocab_size': 10000,  # Vocabulary size
    'max_seq_len': 512,   # Maximum sequence length
    'dropout': 0.1,       # Dropout rate
}

print("Model Configuration")
print("=" * 40)
for k, v in config.items():
    print(f"  {k:15} = {v}")

Model Configuration
========================================
  d_model         = 256
  n_heads         = 4
  n_layers        = 4
  d_ff            = 1024
  vocab_size      = 10000
  max_seq_len     = 512
  dropout         = 0.1

Parameter Count: Where Do All the Numbers Live?¶

Understanding where parameters come from helps you reason about model capacity and memory. Let’s count them:

Token Embeddings: $|V| \times d_{model}$

One $d_{model}$ -dimensional vector per vocabulary token
For us: $10,000 \times 256 = 2,560,000$ parameters

Position Embeddings (if learned): $n_{ctx} \times d_{model}$

One vector per position
For us: $512 \times 256 = 131,072$ parameters

Per Transformer Block:

Attention: $4 \times d_{model}^2$ (for Q, K, V, and output projections)
FFN: $2 \times d_{model} \times d_{ff}$ (up projection and down projection)
LayerNorm: $4 \times d_{model}$ (two layer norms, each with scale and shift)

Output Projection: $d_{model} \times |V|$

Projects from embedding space back to vocabulary (often tied with input embeddings)

# Let's count parameters for our configuration
d_model = config['d_model']
n_heads = config['n_heads']
n_layers = config['n_layers']
d_ff = config['d_ff']
vocab_size = config['vocab_size']
max_seq_len = config['max_seq_len']

# Embeddings
token_embed_params = vocab_size * d_model
pos_embed_params = max_seq_len * d_model

# Per block
attention_params = 4 * d_model * d_model  # Q, K, V, O projections (ignoring biases for simplicity)
ffn_params = 2 * d_model * d_ff           # Up and down projections
layernorm_params = 4 * d_model            # 2 layer norms × (scale + shift)
block_params = attention_params + ffn_params + layernorm_params

# Total
total_block_params = n_layers * block_params
output_proj_params = d_model * vocab_size  # Often tied with token embeddings
final_ln_params = 2 * d_model

total_params = token_embed_params + pos_embed_params + total_block_params + final_ln_params
# Note: if we tie embeddings, we don't count output_proj_params separately

print("Parameter Count Breakdown")
print("=" * 50)
print(f"Token embeddings:      {token_embed_params:>12,} ({token_embed_params/1e6:.2f}M)")
print(f"Position embeddings:   {pos_embed_params:>12,} ({pos_embed_params/1e6:.2f}M)")
print()
print(f"Per transformer block:")
print(f"  Attention (Q,K,V,O): {attention_params:>12,}")
print(f"  Feed-forward:        {ffn_params:>12,}")
print(f"  LayerNorm (×2):      {layernorm_params:>12,}")
print(f"  Block total:         {block_params:>12,}")
print()
print(f"All {n_layers} blocks:          {total_block_params:>12,} ({total_block_params/1e6:.2f}M)")
print(f"Final LayerNorm:       {final_ln_params:>12,}")
print("=" * 50)
print(f"Total parameters:      {total_params:>12,} ({total_params/1e6:.2f}M)")

Parameter Count Breakdown
==================================================
Token embeddings:         2,560,000 (2.56M)
Position embeddings:        131,072 (0.13M)

Per transformer block:
  Attention (Q,K,V,O):      262,144
  Feed-forward:             524,288
  LayerNorm (×2):             1,024
  Block total:              787,456

All 4 blocks:             3,149,824 (3.15M)
Final LayerNorm:                512
==================================================
Total parameters:         5,841,408 (5.84M)

A Note on Scale¶

Our model is tiny—about 5-6 million parameters. For context:

Model	Parameters	Relative Size
Our model	~5M	1×
GPT-2 Small	117M	20×
GPT-2 Large	774M	130×
GPT-3	175B	30,000×
GPT-4	~1.8T (rumored)	300,000×

The beauty of this: the architecture is identical. The same attention mechanism, the same feed-forward structure, the same residual connections. GPT-3 is just our model with bigger matrices and more layers.

Understanding our 5M parameter model means understanding GPT-3. The math scales; the concepts don’t change.

What You’ll Need¶

Prerequisites:

Basic Python and PyTorch (tensors, modules, autograd)
Linear algebra fundamentals (matrix multiplication, transpose, dot products)
Some calculus (we’ll compute gradients, but PyTorch does the heavy lifting)

Mindset:

Run the code cells—this is meant to be interactive
Modify things and see what breaks
The goal is understanding, not just running

Each notebook is self-contained with working code. You can run them in order, or jump to a specific topic if you want to understand one component in isolation.

Let’s Build¶

We’ll start where every language model starts: turning text into numbers.

In the next notebook, we’ll implement token embeddings and explore three different approaches to positional encoding—learned embeddings (GPT-2 style), ALiBi (BLOOM style), and RoPE (LLaMA style). Each has its trade-offs, and understanding them will give you intuition for how transformers represent sequences.

Ready? Let’s go.