I’ve spent the last week or so building two educational projects to understand how transformer models work. It started as an overview, but after I got the high level, I wanted to dig deeper and see all the math.So now you can too…every calculation, every design choice, every gradient flowing backward through the network.
The first project is a complete transformer implementation in PyTorch. The second is a step-by-step walkthrough of every single calculation in training a tiny transformer model by hand. They’re companions to each other - one shows you how to build a working model, the other shows you exactly what’s happening under the hood.
Why Build This?
Large language models like GPT, Claude, and others have become essential tools for developers and researchers. But understanding how they actually work requires getting your hands dirty with the mathematics and code.
I wanted to move beyond treating transformers as black boxes. So Claude, ironically, built them from scratch for me, documented every component, and calculated every derivative with its digital hand.
The Transformer Project
The transformer project is a complete decoder-only transformer (GPT architecture) built in PyTorch. It includes everything you need to train and understand a real language model:
Core Architecture
- Multi-head self-attention with KV-cache optimization
- Positional embeddings (learned, not sinusoidal)
- Feed-forward networks with GELU activation
- Pre-layer normalization architecture
- Residual connections throughout
Training Pipeline
- Streams from HuggingFace’s FineWeb dataset (10 billion tokens)
- Gradient accumulation for stable training
- Learning rate scheduling with warmup and cosine decay
- Train/val split for monitoring progress
Text Generation
- Advanced sampling strategies (greedy, top-k, top-p, combined)
- KV-cache optimization that speeds up generation by 2-50x
- Interactive CLI for easy experimentation
Interpretability Tools
- Logit lens: visualize how predictions evolve through layers
- Attention analysis: understand what each head focuses on
- Induction head detection: find pattern-matching circuits
- Activation patching: test which components are causally responsible for behaviors
Every component includes extensive documentation explaining not just what it does, but why it’s designed that way. The code prioritizes clarity over performance - this is for learning, not production. I’ve tested my patience trying to finish a full run on my AMD GPU, but have restarted multiple times. Soon I’ll probably have a checkpoint to share.
Attention to Detail
The attention-to-detail project takes a different approach. Instead of building a full model, it walks through every single calculation in training a tiny transformer:
What “by hand” means:
- 16-dimensional embeddings (vs GPT-3’s 12,288)
- 2 attention heads (vs GPT-3’s 96)
- 1 transformer layer (vs GPT-3’s 96)
- Training text: “I like transformers”
The project shows all calculations step-by-step:
- Forward pass: tokenization, embeddings, attention, feed-forward, loss
- Backward pass: gradients for every parameter using backpropagation and chain rule
- Optimization: AdamW updates with momentum and adaptive learning rates
This isn’t about memorizing formulas. It’s about building intuition by seeing the actual numbers flow through the network. Every matrix multiplication is shown in full. Every Jacobian is derived. Every gradient is calculated explicitly.
The documentation includes interactive math rendering, color-coded matrices, and Python scripts you can run to verify every calculation yourself. No PyTorch, no NumPy.
What I Learned
From the transformer implementation:
- How attention mechanisms actually compute relevance between tokens
- Why KV-caching makes such a dramatic difference in generation speed
- What induction heads are and why they’re crucial for in-context learning
- How activation patching reveals causal structure in neural networks
From the manual calculations:
- Why softmax gradients involve Jacobian matrices
- How layer normalization affects gradient flow
- Why AdamW uses both first and second moment estimates
- The actual shapes and values flowing through a transformer at each step
Two Paths to Understanding
These projects represent two complementary approaches to learning:
The transformer implementation lets you build something real. You can train it on actual data, generate text, and use interpretability tools to understand what it learned. It’s practical and complete, albeit extremely rudimentary and not very effective.
The manual calculations force you to understand the mathematics deeply. You can’t hand-wave away a Jacobian matrix when you’re calculating every element by hand. It’s rigorous and foundational and quite a lot of decimals.
Together, they provide both breadth and depth. The transformer shows you what’s possible. The manual calculations show you why it works.
Getting Started
Both projects are on GitHub but the documentation is more readable:
- Transformer - Full PyTorch implementation with training, generation, and interpretability
- Attention to Detail - Step-by-step manual calculations
If you’re interested in understanding how modern LLMs work, I’d suggest starting with the transformer project. Read through the core components in order: attention, embeddings, feedforward, blocks, model. Then try the manual calculations to see exactly what’s happening at each step.
These are learning resources, built to prioritize understanding over everything else. Hopefully you find them useful too!