Try It Yourself
Ready to train and experiment with your own transformer? The complete implementation is available on GitHub with everything you need to get started.
Quick Start
Section titled “Quick Start”git clone https://github.com/zhubert/transformer.gitcd transformer
# Install dependencies with uvmake install
# Launch interactive CLI - easiest way to get started!python main.py
What You Can Do
Section titled “What You Can Do”Train a Model
Section titled “Train a Model”Train on realistic data (FineWeb 10BT) with modern techniques:
# Full training (100M tokens/epoch)make train
# Quick training (10M tokens/epoch, smaller model)make train-quickFeatures:
- Gradient accumulation for stable training
- Train/val split for overfitting detection
- Auto-detect CUDA/MPS/CPU
- Checkpointing and resume
- Real-time metrics
Generate Text
Section titled “Generate Text”Generate creative text with various sampling strategies:
make generateOptions:
- Greedy (deterministic)
- Top-k sampling
- Top-p (nucleus) sampling
- Temperature control
- KV-cache optimization (2-50x faster)
Explore Interpretability
Section titled “Explore Interpretability”Understand what your model learned:
# Logit lens - see predictions evolve by layerpython main.py interpret logit-lens checkpoints/model.pt --demo
# Attention analysis - discover specialized headspython main.py interpret attention checkpoints/model.pt
# Induction heads - find in-context learning circuitspython main.py interpret induction-heads checkpoints/model.pt
# Activation patching - causal experimentspython main.py interpret patch checkpoints/model.pt \ --clean "The Eiffel Tower is in" \ --corrupted "The Empire State is in" \ --target "Paris"Evaluate Models
Section titled “Evaluate Models”Compare perplexity across checkpoints:
python main.py evaluate checkpoints/model.ptLearn the Code
Section titled “Learn the Code”The codebase is designed for learning:
- Comprehensive documentation in every file
- Inline comments explaining the “why”
- Mathematical formulas and complexity analysis
- No magic - everything implemented from scratch
- Modern best practices (Pre-LN, ALiBi, GELU)
Key files to explore:
src/transformer/attention.py- Attention mechanism with KV-cachesrc/transformer/embeddings.py- ALiBi, RoPE, and learned embeddingssrc/transformer/model.py- Complete decoder-only transformersrc/transformer/training_utils.py- Gradient accumulationsrc/transformer/interpretability/- All interpretability tools
Next Steps
Section titled “Next Steps”- Train your first model with
make train-quick(~30 minutes on M1 Mac) - Generate text and see what it learns
- Explore interpretability tools to understand the internals
- Read the code - every file is documented for learning
- Experiment - try different architectures, datasets, hyperparameters
Resources
Section titled “Resources”- GitHub Repository
- Full Documentation
- Implementation Guide
- Test Suite
- Original Paper: “Attention is All You Need”
- Attention to Detail - Calculate a full forward pass of a transformer by hand to deeply understand the mechanics
Open source project under MIT License for educational purposes. Built with PyTorch to understand the architecture that powers modern AI.