Understanding Transformers: Two Approaches

I’ve spent the last week or so building two educational projects to understand how transformer models work. It started as an overview, but after I got the high level, I wanted to dig deeper and see all the math.So now you can too…every calculation, every design choice, every gradient flowing backward through the network.

The first project is a complete transformer implementation in PyTorch. The second is a step-by-step walkthrough of every single calculation in training a tiny transformer model by hand. They’re companions to each other - one shows you how to build a working model, the other shows you exactly what’s happening under the hood.

Why Build This?

Large language models like GPT, Claude, and others have become essential tools for developers and researchers. But understanding how they actually work requires getting your hands dirty with the mathematics and code.

I wanted to move beyond treating transformers as black boxes. So Claude, ironically, built them from scratch for me, documented every component, and calculated every derivative with its digital hand.

The Transformer Project

The transformer project is a complete decoder-only transformer (GPT architecture) built in PyTorch. It includes everything you need to train and understand a real language model:

Core Architecture

Training Pipeline

Text Generation

Interpretability Tools

Every component includes extensive documentation explaining not just what it does, but why it’s designed that way. The code prioritizes clarity over performance - this is for learning, not production. I’ve tested my patience trying to finish a full run on my AMD GPU, but have restarted multiple times. Soon I’ll probably have a checkpoint to share.

Attention to Detail

The attention-to-detail project takes a different approach. Instead of building a full model, it walks through every single calculation in training a tiny transformer:

What “by hand” means:

The project shows all calculations step-by-step:

This isn’t about memorizing formulas. It’s about building intuition by seeing the actual numbers flow through the network. Every matrix multiplication is shown in full. Every Jacobian is derived. Every gradient is calculated explicitly.

The documentation includes interactive math rendering, color-coded matrices, and Python scripts you can run to verify every calculation yourself. No PyTorch, no NumPy.

What I Learned

From the transformer implementation:

From the manual calculations:

Two Paths to Understanding

These projects represent two complementary approaches to learning:

The transformer implementation lets you build something real. You can train it on actual data, generate text, and use interpretability tools to understand what it learned. It’s practical and complete, albeit extremely rudimentary and not very effective.

The manual calculations force you to understand the mathematics deeply. You can’t hand-wave away a Jacobian matrix when you’re calculating every element by hand. It’s rigorous and foundational and quite a lot of decimals.

Together, they provide both breadth and depth. The transformer shows you what’s possible. The manual calculations show you why it works.

Getting Started

Both projects are on GitHub but the documentation is more readable:

If you’re interested in understanding how modern LLMs work, I’d suggest starting with the transformer project. Read through the core components in order: attention, embeddings, feedforward, blocks, model. Then try the manual calculations to see exactly what’s happening at each step.

These are learning resources, built to prioritize understanding over everything else. Hopefully you find them useful too!