Skip to content

Try It Yourself

Ready to train and experiment with your own transformer? The complete implementation is available on GitHub with everything you need to get started.

Terminal window
git clone https://github.com/zhubert/transformer.git
cd transformer
# Install dependencies with uv
make install
# Launch interactive CLI - easiest way to get started!
python main.py

Interactive CLI menu

Train on realistic data (FineWeb 10BT) with modern techniques:

Terminal window
# Full training (100M tokens/epoch)
make train
# Quick training (10M tokens/epoch, smaller model)
make train-quick

Features:

  • Gradient accumulation for stable training
  • Train/val split for overfitting detection
  • Auto-detect CUDA/MPS/CPU
  • Checkpointing and resume
  • Real-time metrics

Generate creative text with various sampling strategies:

Terminal window
make generate

Options:

  • Greedy (deterministic)
  • Top-k sampling
  • Top-p (nucleus) sampling
  • Temperature control
  • KV-cache optimization (2-50x faster)

Understand what your model learned:

Terminal window
# Logit lens - see predictions evolve by layer
python main.py interpret logit-lens checkpoints/model.pt --demo
# Attention analysis - discover specialized heads
python main.py interpret attention checkpoints/model.pt
# Induction heads - find in-context learning circuits
python main.py interpret induction-heads checkpoints/model.pt
# Activation patching - causal experiments
python main.py interpret patch checkpoints/model.pt \
--clean "The Eiffel Tower is in" \
--corrupted "The Empire State is in" \
--target "Paris"

Compare perplexity across checkpoints:

Terminal window
python main.py evaluate checkpoints/model.pt

The codebase is designed for learning:

  • Comprehensive documentation in every file
  • Inline comments explaining the “why”
  • Mathematical formulas and complexity analysis
  • No magic - everything implemented from scratch
  • Modern best practices (Pre-LN, ALiBi, GELU)

Key files to explore:

  • src/transformer/attention.py - Attention mechanism with KV-cache
  • src/transformer/embeddings.py - ALiBi, RoPE, and learned embeddings
  • src/transformer/model.py - Complete decoder-only transformer
  • src/transformer/training_utils.py - Gradient accumulation
  • src/transformer/interpretability/ - All interpretability tools
  1. Train your first model with make train-quick (~30 minutes on M1 Mac)
  2. Generate text and see what it learns
  3. Explore interpretability tools to understand the internals
  4. Read the code - every file is documented for learning
  5. Experiment - try different architectures, datasets, hyperparameters

Open source project under MIT License for educational purposes. Built with PyTorch to understand the architecture that powers modern AI.