Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Reasoning with Transformers

Language models generate one token at a time, left to right, with no scratch paper. For simple factual recall, this works fine. But for multi-step problems—math, logic, planning—the model needs to compute the answer, not just recall it.

The breakthrough: if models generate intermediate reasoning steps before the final answer, their accuracy on hard problems improves dramatically. A model that “thinks out loud” can solve problems that stump models forced to answer immediately.

This section covers the techniques that make this possible.

The Problem with Instant Answers

Transformers are autoregressive. They generate one token at a time, left to right, based on everything that came before.

When asked a question, the model must commit to its answer immediately. There’s no scratch paper. No thinking time. Just token after token.

For simple questions, this works:

  • “What’s the capital of France?” → “Paris” ✓

  • “Who wrote Hamlet?” → “Shakespeare” ✓

But for anything requiring multiple steps, performance degrades quickly.

Consider: “If a train leaves Chicago at 9am going 60mph, and another train leaves New York at 10am going 80mph, when do they meet?”

The model can’t just know the answer. It needs to:

  1. Figure out the distance between the cities

  2. Account for the time offset

  3. Set up the equations

  4. Solve them

If it tries to output the answer immediately, it’s guessing.

The Solution: Think Out Loud

The key insight that changed the field:

The model’s reasoning ability is bottlenecked by output length, not parameter count.

A smaller model that “thinks” for 1,000 tokens can outperform a larger model that answers in 10 tokens.

When the model generates intermediate reasoning steps, it can:

  1. Break down complex problems into manageable pieces

  2. Use its own output as working memory (transformers don’t have scratch space—but they can read what they’ve already written)

  3. Catch and correct mistakes before committing to a final answer

This is called Chain-of-Thought reasoning. It’s the foundation of everything in this section.

Without CoT:
Q: What's 17 × 24?
A: 408  ← Just guessing. Often wrong.

With CoT:
Q: What's 17 × 24?
A: Let me work through this step by step.
   17 × 24 = 17 × (20 + 4)
           = 17 × 20 + 17 × 4
           = 340 + 68
           = 408  ← Worked it out. Actually correct!

Same model. Same parameters. But by generating intermediate steps, it can actually compute the answer instead of guessing.

The Paradigm Shift: Test-Time Compute

For years, the recipe for better AI was simple: train bigger models on more data.

More parameters → better performance. More training data → better performance.

But we’re hitting walls. Data is finite. Training compute is expensive. Diminishing returns set in.

A new idea has emerged: test-time compute scaling.

Instead of making the model bigger, let it think longer.

The math is compelling. Researchers at Google found that on reasoning problems, using extra compute at inference time can outperform a model that’s 14x larger. (That’s the “Scaling LLM Test-Time Compute” paper from 2024.)

This is the approach powering models like OpenAI’s o1, DeepSeek-R1, and Google’s Gemini 2.0 Flash Thinking. They don’t just generate answers—they reason first.

We’re going to build it ourselves.

What We’ll Build

This section takes you from prompting tricks all the way to training your own reasoning model.

┌─────────────────────────────────────────────────────────────────────┐
│                    PROMPTING-BASED REASONING                        │
│  No training required — just clever prompts                         │
├─────────────────────────────────────────────────────────────────────┤
│  Chain-of-Thought      │ "Let's think step by step"                 │
│  Self-Consistency      │ Sample many chains, vote on the answer     │
│  Tree of Thoughts      │ Explore multiple paths, backtrack          │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    VERIFICATION & SEARCH                            │
│  Train a reward model to guide reasoning                            │
├─────────────────────────────────────────────────────────────────────┤
│  Process Reward Model  │ Score each reasoning step                  │
│  Best-of-N Sampling    │ Generate N solutions, pick the best        │
│  Monte Carlo Search    │ Tree search with learned heuristics        │
└─────────────────────────────────────────────────────────────────────┘
                              │
                              ▼
┌─────────────────────────────────────────────────────────────────────┐
│                    TRAINING REASONING MODELS                        │
│  RL-based techniques for learning to reason                         │
├─────────────────────────────────────────────────────────────────────┤
│  Budget Forcing        │ Force longer thinking with "Wait" tokens   │
│  GRPO                  │ RL without a critic (DeepSeek's approach)  │
│  Distillation          │ Transfer reasoning to smaller models       │
└─────────────────────────────────────────────────────────────────────┘

Each notebook builds on the last. By the end, you’ll understand exactly how models like o1 and DeepSeek-R1 work—not just conceptually, but with working code.

Scope and Limitations

OpenAI’s o1 was trained on massive amounts of compute with carefully curated data. We’re not going to replicate that.

What we are going to do:

  1. Understand the principles — What makes these techniques work?

  2. Implement them from scratch — Real code, not hand-wavy pseudocode

  3. See them in action — On problems small enough to run on your laptop

  4. Build intuition — So you can apply these ideas to your own projects

The goal is understanding, not production deployment.

Prerequisites

This section assumes you’ve worked through the earlier parts of this book, or have equivalent knowledge:

  • Transformers — How attention and generation work

  • Fine-tuning — SFT, reward models, the basics of RLHF

  • PyTorch — Comfortable with tensors, autograd, training loops

If you’re coming from the “Fine-Tuning a Transformer” section, you’re in great shape. We’ll build directly on those concepts—especially reward modeling, which becomes process reward modeling in this context.

# Let's make sure we have what we need
import torch
import torch.nn as nn
import torch.nn.functional as F
from transformers import AutoTokenizer, AutoModelForCausalLM
import numpy as np
import matplotlib.pyplot as plt

print(f"PyTorch version: {torch.__version__}")
print(f"CUDA available: {torch.cuda.is_available()}")

if torch.cuda.is_available():
    print(f"CUDA device: {torch.cuda.get_device_name(0)}")
elif hasattr(torch.backends, 'mps') and torch.backends.mps.is_available():
    print("MPS (Apple Silicon) available")
else:
    print("Running on CPU (slower but still works!)")

print("\nAll set! Let's teach some transformers to think.")
PyTorch version: 2.10.0.dev20251124+rocm7.1
CUDA available: True
CUDA device: Radeon RX 7900 XTX

All set! Let's teach some transformers to think.

The Roadmap

NotebookTopicKey Idea
01Chain-of-ThoughtPrompting models to show their work
02Self-ConsistencySampling multiple chains, majority voting
03Tree of ThoughtsExploring and pruning reasoning paths
04Process Reward ModelsScoring individual reasoning steps
05Best-of-N with VerificationUsing PRMs to select best solutions
06Monte Carlo Tree SearchSearch algorithms for reasoning
07Budget ForcingControlling reasoning length with “Wait” tokens
08GRPO TrainingRL without a critic (DeepSeek’s method)
09Reasoning DistillationTransferring reasoning to smaller models

Each notebook is self-contained, but they build on each other conceptually. Going in order is recommended.

Let’s start with the technique that kicked off this field: Chain-of-Thought prompting.

References

This section draws on a lot of recent research. Here are the key papers if you want to go deeper:

Foundational:

Verification:

Search:

Training Reasoning Models:

Scaling:

Don’t worry about reading all of these. We’ll cover the key ideas as we go.