I’ve been interested in Artificial Intelligence, through many different lenses, for quite some time. As a science fiction fan and one time author, it’s obviously a subject that’s been treated thoroughly from every angle. It’s good! It’s bad! It’s going to end humanity!
But it (“actual” AI) felt incredibly far away even just a few years ago. At that point in time, I did a bit of work in Text Classification and Object Detection and the results were definitely mixed. On the textual front, it didn’t feel orders of magnitude different than ELIZA, and the visual front was still difficult to make reliable.
Then in 2023 ChatGPT came along and took the world by storm. This was more than classification, more than a fancy autocomplete, this was an assistant! The next sea change came when I used Claude Code for the first time. Reasoning and engineering knowledge combined to make a pretty remarkable coding agent.
While I still don’t think we are anywhere near AGI (happy to be proved wrong!), the sheer effectiveness of this new tech was remarkable and sparked my curiosity. I wanted to learn everything about it. I’ve got a strong math background, maybe with some hard work I could start to understand it at an amateur level.
Similar to my Rush programming language project, I decided to roll the dice and see if Claude could teach me. One cleverly built prompt later, Claude wrote the book I’m now happy to make available!
Just kidding, I think I’ve gone back and forth with Claude on every line in this booklet a few times. Primarily because I’m an outsider to this domain so I need things to be explained more thoroughly, but also because the state of the art is still a bit imperfect: Claude says some crazy stuff some times!
Together we built a learning resource that walks through transformers, the modern marvel behind all of this, from the ground up: the math, the code, the training techniques, and even how to extend them into entirely new domains like image generation.
It’s organized into five sections:
Understanding Gradients starts with pure Python — no libraries, no shortcuts. Just the math and a transformer you build from scratch. Tokenization, embeddings, attention, backpropagation, the whole thing. If you’ve ever wondered what’s actually happening under the hood (and have heard of a Jacobian), this is where you start.
Building a Transformer moves into PyTorch and shows you how to build a real one. This is where to start if you want to dabble in building a toy machine and aren’t too worried about the theory underpinning it all. Positional encodings, causal masking, multi-head attention, KV-cache optimization. The stuff you’d actually use in production.
Fine-Tuning a Transformer is where things start to get really useful. How do you take a base model and turn it into something useful? Supervised fine-tuning, LoRA, reward models, RLHF, DPO — all the techniques that turn a next-token predictor into an assistant that actually follows instructions. Don’t worry, all those acronyms will be explained.
Reasoning with Transformers digs into some more advanced stuff. Chain-of-thought prompting, tree search, process reward models, reasoning distillation. The techniques that let models tackle complex, multi-step problems.
From Noise to Images takes everything we’ve learned and applies it to a completely different domain. Flow matching, diffusion transformers, text-to-image conditioning, latent diffusion. It turns out the same core ideas generalize surprisingly well from text to images.
The whole thing is live at zhubert.com/intro-to-transformers. It’s a work in progress — I’m still learning, still refining, still finding better ways to explain things. But it’s already substantial, and I wanted to share it.
If you’ve been curious about how modern AI actually works, this might be a good place to start. Not because I’m an expert (I’m not), but because I’ve spent a lot of time trying to understand this stuff, and I think I’ve found some good ways to explain it.