Language models generate one token at a time, left to right, with no scratch paper. For simple factual recall, this works fine. But for multi-step problems—math, logic, planning—the model needs to compute the answer, not just recall it.
The breakthrough: if models generate intermediate reasoning steps before the final answer, their accuracy on hard problems improves dramatically. A model that “thinks out loud” can solve problems that stump models forced to answer immediately.
This section covers the techniques that make this possible.
The Problem with Instant Answers¶
Transformers are autoregressive. They generate one token at a time, left to right, based on everything that came before.
When asked a question, the model must commit to its answer immediately. There’s no scratch paper. No thinking time. Just token after token.
For simple questions, this works:
“What’s the capital of France?” → “Paris” ✓
“Who wrote Hamlet?” → “Shakespeare” ✓
But for anything requiring multiple steps, performance degrades quickly.
Consider: “If a train leaves Chicago at 9am going 60mph, and another train leaves New York at 10am going 80mph, when do they meet?”
The model can’t just know the answer. It needs to:
Figure out the distance between the cities
Account for the time offset
Set up the equations
Solve them
If it tries to output the answer immediately, it’s guessing.
The Solution: Think Out Loud¶
The key insight that changed the field:
The model’s reasoning ability is bottlenecked by output length, not parameter count.
A smaller model that “thinks” for 1,000 tokens can outperform a larger model that answers in 10 tokens.
When the model generates intermediate reasoning steps, it can:
Break down complex problems into manageable pieces
Use its own output as working memory (transformers don’t have scratch space—but they can read what they’ve already written)
Catch and correct mistakes before committing to a final answer
This is called Chain-of-Thought reasoning. It’s the foundation of everything in this section.
Without CoT:
Q: What's 17 × 24?
A: 408 ← Just guessing. Often wrong.
With CoT:
Q: What's 17 × 24?
A: Let me work through this step by step.
17 × 24 = 17 × (20 + 4)
= 17 × 20 + 17 × 4
= 340 + 68
= 408 ← Worked it out. Actually correct!Same model. Same parameters. But by generating intermediate steps, it can actually compute the answer instead of guessing.
The Paradigm Shift: Test-Time Compute¶
For years, the recipe for better AI was simple: train bigger models on more data.
More parameters → better performance. More training data → better performance.
But we’re hitting walls. Data is finite. Training compute is expensive. Diminishing returns set in.
A new idea has emerged: test-time compute scaling.
Instead of making the model bigger, let it think longer.
The math is compelling. Researchers at Google found that on reasoning problems, using extra compute at inference time can outperform a model that’s 14x larger. (That’s the “Scaling LLM Test-Time Compute” paper from 2024.)
This is the approach powering models like OpenAI’s o1, DeepSeek-R1, and Google’s Gemini 2.0 Flash Thinking. They don’t just generate answers—they reason first.
We’re going to build it ourselves.
What We’ll Build¶
This section takes you from prompting tricks all the way to training your own reasoning model.
┌─────────────────────────────────────────────────────────────────────┐
│ PROMPTING-BASED REASONING │
│ No training required — just clever prompts │
├─────────────────────────────────────────────────────────────────────┤
│ Chain-of-Thought │ "Let's think step by step" │
│ Self-Consistency │ Sample many chains, vote on the answer │
│ Tree of Thoughts │ Explore multiple paths, backtrack │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ VERIFICATION & SEARCH │
│ Train a reward model to guide reasoning │
├─────────────────────────────────────────────────────────────────────┤
│ Process Reward Model │ Score each reasoning step │
│ Best-of-N Sampling │ Generate N solutions, pick the best │
│ Monte Carlo Search │ Tree search with learned heuristics │
└─────────────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────┐
│ TRAINING REASONING MODELS │
│ RL-based techniques for learning to reason │
├─────────────────────────────────────────────────────────────────────┤
│ Budget Forcing │ Force longer thinking with "Wait" tokens │
│ GRPO │ RL without a critic (DeepSeek's approach) │
│ Distillation │ Transfer reasoning to smaller models │
└─────────────────────────────────────────────────────────────────────┘Each notebook builds on the last. By the end, you’ll understand exactly how models like o1 and DeepSeek-R1 work—not just conceptually, but with working code.
Scope and Limitations¶
OpenAI’s o1 was trained on massive amounts of compute with carefully curated data. We’re not going to replicate that.
What we are going to do:
Understand the principles — What makes these techniques work?
Implement them from scratch — Real code, not hand-wavy pseudocode
See them in action — On problems small enough to run on your laptop
Build intuition — So you can apply these ideas to your own projects
The goal is understanding, not production deployment.
Prerequisites¶
This section assumes you’ve worked through the earlier parts of this book, or have equivalent knowledge:
Transformers — How attention and generation work
Fine-tuning — SFT, reward models, the basics of RLHF
PyTorch — Comfortable with tensors, autograd, training loops
If you’re coming from the “Fine-Tuning a Transformer” section, you’re in great shape. We’ll build directly on those concepts—especially reward modeling, which becomes process reward modeling in this context.
# Let's make sure we have what we need
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
import matplotlib.pyplot as plt
print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")
if torch.cuda.is_available():
print(f"CUDA device: {torch.cuda.get_device_name(0)}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
print("MPS (Apple Silicon) available")
else:
print("Running on CPU (slower but still works!)")
print("\nAll set! Let's teach some transformers to think.")PyTorch version: 2.10.0.dev20251124+rocm7.1
CUDA available: True
CUDA device: Radeon RX 7900 XTX
All set! Let's teach some transformers to think.
The Roadmap¶
| Notebook | Topic | Key Idea |
|---|---|---|
| 01 | Chain-of-Thought | Prompting models to show their work |
| 02 | Self-Consistency | Sampling multiple chains, majority voting |
| 03 | Tree of Thoughts | Exploring and pruning reasoning paths |
| 04 | Process Reward Models | Scoring individual reasoning steps |
| 05 | Best-of-N with Verification | Using PRMs to select best solutions |
| 06 | Monte Carlo Tree Search | Search algorithms for reasoning |
| 07 | Budget Forcing | Controlling reasoning length with “Wait” tokens |
| 08 | GRPO Training | RL without a critic (DeepSeek’s method) |
| 09 | Reasoning Distillation | Transferring reasoning to smaller models |
Each notebook is self-contained, but they build on each other conceptually. Going in order is recommended.
Let’s start with the technique that kicked off this field: Chain-of-Thought prompting.
References¶
This section draws on a lot of recent research. Here are the key papers if you want to go deeper:
Foundational:
Wei et al. (2022) — Chain-of-Thought Prompting Elicits Reasoning in Large Language Models
Kojima et al. (2022) — Large Language Models are Zero-Shot Reasoners (the “Let’s think step by step” paper)
Verification:
Lightman et al. (2023) — Let’s Verify Step by Step (Process Reward Models)
Wang et al. (2023) — Math-Shepherd: Verify and Reinforce LLMs Step-by-step without Human Annotations
Search:
Yao et al. (2023) — Tree of Thoughts: Deliberate Problem Solving with Large Language Models
Training Reasoning Models:
OpenAI (2024) — Learning to Reason with LLMs (o1 announcement)
DeepSeek (2025) — DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Muennighoff et al. (2025) — s1: Simple Test-Time Scaling
Scaling:
Snell et al. (2024) — Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters
Don’t worry about reading all of these. We’ll cover the key ideas as we go.