Chain-of-thought is linear: one step follows another, marching toward an answer. But human problem-solving isn’t linear. We explore options, hit dead ends, backtrack, and try different approaches.
Tree of Thoughts (ToT) brings that flexibility to language models.
From Chain to Tree¶
Compare the two paradigms:
Chain-of-Thought:
Question → Step 1 → Step 2 → Step 3 → Answer
(hope it's right!)Tree of Thoughts:
Question
│
┌───────────┼───────────┐
▼ ▼ ▼
Step 1a Step 1b Step 1c
│ │ ✗ (bad idea, prune)
┌───┴───┐ ┌───┴───┐
▼ ▼ ▼ ▼
Step 2a Step 2b Step 2c Step 2d
│ ✗ │ │
▼ ▼ ▼
Answer Answer Answer
✓Key differences:
Branching: At each step, generate multiple possible continuations
Evaluation: Score each branch to identify promising vs. hopeless paths
Pruning: Abandon bad branches early (save compute!)
Backtracking: If a path leads nowhere, try a different one
The ToT Framework¶
Tree of Thoughts has four key components:
1. Thought Decomposition¶
How do we break the problem into “thoughts” (intermediate steps)?
For math: each computation step
For writing: each paragraph or idea
For puzzles: each move or decision
2. Thought Generation¶
How do we generate candidate thoughts at each step?
Sample: Generate N thoughts independently (like self-consistency)
Propose: Have the model propose several alternatives in one call
3. Thought Evaluation¶
How do we know which thoughts are promising?
Self-evaluation: Ask the model to rate its own thoughts
Voting: Generate continuations, see which lead to success
Heuristics: Problem-specific scoring functions
4. Search Algorithm¶
How do we navigate the tree?
Breadth-First Search (BFS): Explore all options at each level before going deeper
Depth-First Search (DFS): Go deep first, backtrack on failure
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from dataclasses import dataclass
from typing import List, Optional, Callable
import re
# Load model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto")
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()
print(f"Loaded on {device}")Loading Qwen/Qwen2.5-1.5B-Instruct...
Loaded on cuda
@dataclass
class ThoughtNode:
"""
A node in the Tree of Thoughts.
Each node represents a partial solution state.
"""
thought: str # The reasoning step at this node
state: str # Full state (all thoughts so far)
score: float = 0.0 # Evaluation score
depth: int = 0 # How deep in the tree
parent: Optional['ThoughtNode'] = None
children: List['ThoughtNode'] = None
def __post_init__(self):
if self.children is None:
self.children = []
def get_path(self) -> List[str]:
"""Get the full reasoning path from root to this node."""
path = []
node = self
while node is not None:
path.append(node.thought)
node = node.parent
return list(reversed(path))
def generate_text(prompt: str, max_new_tokens: int = 50,
temperature: float = 0.7) -> str:
"""Generate text continuation."""
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
temperature=temperature,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
return full_response[len(prompt):].strip()
print("ThoughtNode defined.")ThoughtNode defined.
Example: Game of 24¶
The original ToT paper used the “Game of 24” as a key benchmark. The rules:
You’re given 4 numbers (e.g., 1, 2, 3, 4)
Use +, -, ×, ÷ to make 24
Use each number exactly once
Example: 1, 2, 3, 4 → (1 + 2 + 3) × 4 = 24 ✓
This is perfect for ToT because:
There are many possible paths (which operation to try first?)
Some paths are dead ends (can’t reach 24)
We can evaluate intermediate states (“is 24 still reachable?”)
def generate_thoughts(state: str, n_thoughts: int = 3,
problem_type: str = "math") -> List[str]:
"""
Generate multiple candidate next thoughts.
We ask the model to propose several possible next steps.
Args:
state: Current reasoning state
n_thoughts: Number of candidates to generate
problem_type: Type of problem (affects prompting)
Returns:
List of candidate next thoughts
"""
prompt = f"""{state}
What are {n_thoughts} different possible next steps? List them as:
1.
2.
3."""
response = generate_text(prompt, max_new_tokens=100, temperature=0.8)
# Parse numbered list
thoughts = []
lines = response.split('\n')
for line in lines:
# Match lines starting with a number
match = re.match(r'^\d+\.\s*(.+)', line.strip())
if match:
thoughts.append(match.group(1).strip())
# If parsing failed, just split by newlines
if len(thoughts) < n_thoughts and lines:
thoughts = [l.strip() for l in lines if l.strip()][:n_thoughts]
return thoughts[:n_thoughts]
# Test thought generation
test_state = """Problem: Use 2, 3, 4, 6 with +, -, ×, ÷ to make 24.
Current: I have numbers 2, 3, 4, 6."""
print("Generating candidate thoughts...\n")
thoughts = generate_thoughts(test_state, n_thoughts=3)
for i, t in enumerate(thoughts, 1):
print(f"{i}. {t}")Generating candidate thoughts...
1. To create the number 24 using the numbers 2, 3, 4, and 6 along with the operations of addition (+), subtraction (-), multiplication (×), and division (÷), we need to find a combination that equals 24. Let's explore three different combinations:
2. ### Combination 1:
3. \[ 6 \times 4 - 3 - 2 = 24 \]
Self-Evaluation¶
The key to ToT is evaluating which thoughts are promising. We can ask the model to rate its own ideas.
For Game of 24, we might ask:
“Can you still make 24 from the remaining numbers?”
Rate as: sure / likely / unlikely / impossible
def evaluate_thought(state: str, thought: str,
evaluation_prompt: str = None) -> float:
"""
Have the model evaluate a thought on a 0-1 scale.
We ask the model to rate the thought as:
- sure (1.0): This will definitely work
- likely (0.7): This looks promising
- possible (0.4): Maybe, not sure
- unlikely (0.1): Probably won't work
Args:
state: Current problem state
thought: The candidate thought to evaluate
evaluation_prompt: Custom evaluation prompt
Returns:
Score from 0.0 to 1.0
"""
if evaluation_prompt is None:
evaluation_prompt = f"""{state}
Proposed next step: {thought}
Evaluate this step. Is it a good approach to solving the problem?
Rate as: sure / likely / possible / unlikely
Rating:"""
response = generate_text(evaluation_prompt, max_new_tokens=10, temperature=0.3)
response_lower = response.lower().strip()
# Parse the rating
if 'sure' in response_lower:
return 1.0
elif 'likely' in response_lower:
return 0.7
elif 'possible' in response_lower:
return 0.4
elif 'unlikely' in response_lower or 'impossible' in response_lower:
return 0.1
else:
# Default to middle score if we can't parse
return 0.5
# Test evaluation
state = "Problem: Find a word that starts with 'A' and means 'happy'."
thoughts = ["The word 'Awesome' might work.", "The word 'Sad' could be it.", "I'll think about 'Afraid'."]
print("Evaluating thoughts:\n")
for thought in thoughts:
score = evaluate_thought(state, thought)
print(f" {thought}")
print(f" → Score: {score:.2f}\n")Evaluating thoughts:
The word 'Awesome' might work.
→ Score: 0.40
The word 'Sad' could be it.
→ Score: 0.70
I'll think about 'Afraid'.
→ Score: 0.40
Breadth-First Search (BFS)¶
BFS explores the tree level by level:
Start with the initial state
Generate all possible first steps
Evaluate and keep the best (“beam width”)
For each kept state, generate next steps
Repeat until reaching a solution or max depth
class TreeOfThoughtsBFS:
"""
Tree of Thoughts with Breadth-First Search.
Explores all options at each level, keeps top-k,
then goes deeper.
"""
def __init__(self, model, tokenizer, device="cuda",
n_candidates: int = 3, beam_width: int = 2, max_depth: int = 3):
self.model = model
self.tokenizer = tokenizer
self.device = device
self.n_candidates = n_candidates # How many thoughts to generate per node
self.beam_width = beam_width # How many nodes to keep at each level
self.max_depth = max_depth # How deep to search
def solve(self, problem: str, is_solution: Callable[[str], bool] = None) -> dict:
"""
Solve a problem using BFS over thoughts.
Args:
problem: The problem statement
is_solution: Function to check if a state is a solution
Returns:
Dict with best solution, path, and all explored nodes
"""
# Initialize with root node
root = ThoughtNode(
thought="Start",
state=f"Problem: {problem}\n\nLet me think step by step.",
score=1.0,
depth=0
)
current_level = [root]
all_nodes = [root]
best_solution = None
best_score = -float('inf')
for depth in range(self.max_depth):
print(f"\nDepth {depth + 1}: Exploring {len(current_level)} nodes...")
next_level = []
for node in current_level:
# Generate candidate thoughts
thoughts = generate_thoughts(
node.state,
n_thoughts=self.n_candidates
)
for thought in thoughts:
# Create new state
new_state = f"{node.state}\n\nStep {depth + 1}: {thought}"
# Evaluate
score = evaluate_thought(node.state, thought)
# Create child node
child = ThoughtNode(
thought=thought,
state=new_state,
score=score,
depth=depth + 1,
parent=node
)
node.children.append(child)
all_nodes.append(child)
next_level.append(child)
# Check if solution
if is_solution and is_solution(new_state):
if score > best_score:
best_solution = child
best_score = score
# Keep only top-k nodes for next level
next_level.sort(key=lambda n: n.score, reverse=True)
current_level = next_level[:self.beam_width]
print(f" Generated {len(next_level)} candidates, kept {len(current_level)}")
for node in current_level:
print(f" Score {node.score:.2f}: {node.thought[:50]}...")
# If no explicit solution found, return best final node
if best_solution is None and current_level:
best_solution = max(current_level, key=lambda n: n.score)
return {
"solution": best_solution,
"path": best_solution.get_path() if best_solution else [],
"final_state": best_solution.state if best_solution else None,
"all_nodes": all_nodes,
"nodes_explored": len(all_nodes)
}
# Create ToT solver
tot_bfs = TreeOfThoughtsBFS(
model, tokenizer, device=device,
n_candidates=2, # Generate 2 candidates per node
beam_width=2, # Keep top 2 at each level
max_depth=2 # Go 2 levels deep
)
print("BFS Tree of Thoughts solver ready.")BFS Tree of Thoughts solver ready.
# Test ToT-BFS on a simple problem
problem = "What is 15 + 28? Show your reasoning."
print(f"Problem: {problem}")
print("\n" + "="*60)
result = tot_bfs.solve(problem)
print("\n" + "="*60)
print("SOLUTION PATH:")
print("="*60)
for i, step in enumerate(result["path"]):
print(f"Step {i}: {step}")
print(f"\nTotal nodes explored: {result['nodes_explored']}")Problem: What is 15 + 28? Show your reasoning.
============================================================
Depth 1: Exploring 1 nodes...
Generated 2 candidates, kept 2
Score 0.70: Count up 28 more units from 15....
Score 0.50: Start with 15 on the right side of the plus sign....
Depth 2: Exploring 2 nodes...
Generated 4 candidates, kept 2
Score 0.70: Step 2: Determine which of the two sums you get in...
Score 0.70: Which sum do I want to use for my final answer?...
============================================================
SOLUTION PATH:
============================================================
Step 0: Start
Step 1: Count up 28 more units from 15.
Step 2: Step 2: Determine which of the two sums you get in Step 1 is correct.
Total nodes explored: 7
Depth-First Search (DFS)¶
DFS explores deeply first, then backtracks:
Start with initial state
Generate candidates for next step
Pick the best one and go deeper
If stuck (bad evaluation), backtrack and try another branch
DFS is more memory-efficient than BFS and can find solutions faster for some problems. But it might miss better solutions on other branches.
class TreeOfThoughtsDFS:
"""
Tree of Thoughts with Depth-First Search.
Explores deeply first, backtracks on failure.
"""
def __init__(self, model, tokenizer, device="cuda",
n_candidates: int = 3, max_depth: int = 4,
prune_threshold: float = 0.2):
self.model = model
self.tokenizer = tokenizer
self.device = device
self.n_candidates = n_candidates
self.max_depth = max_depth
self.prune_threshold = prune_threshold # Don't explore if score < this
self.nodes_explored = 0
def dfs(self, node: ThoughtNode, is_solution: Callable = None) -> Optional[ThoughtNode]:
"""
Recursive DFS exploration.
Returns the first solution found, or None.
"""
self.nodes_explored += 1
# Check depth limit
if node.depth >= self.max_depth:
return node if is_solution is None or is_solution(node.state) else None
# Check if already a solution
if is_solution and is_solution(node.state):
return node
# Generate and evaluate candidates
thoughts = generate_thoughts(node.state, n_thoughts=self.n_candidates)
candidates = []
for thought in thoughts:
new_state = f"{node.state}\n\nStep {node.depth + 1}: {thought}"
score = evaluate_thought(node.state, thought)
# Prune low-scoring candidates
if score < self.prune_threshold:
continue
candidates.append((thought, new_state, score))
# Sort by score, explore best first
candidates.sort(key=lambda x: x[2], reverse=True)
for thought, new_state, score in candidates:
child = ThoughtNode(
thought=thought,
state=new_state,
score=score,
depth=node.depth + 1,
parent=node
)
node.children.append(child)
# Recurse
result = self.dfs(child, is_solution)
if result is not None:
return result
# No solution found in this subtree
return None
def solve(self, problem: str, is_solution: Callable = None) -> dict:
"""
Solve a problem using DFS.
"""
self.nodes_explored = 0
root = ThoughtNode(
thought="Start",
state=f"Problem: {problem}\n\nLet me think step by step.",
score=1.0,
depth=0
)
solution = self.dfs(root, is_solution)
return {
"solution": solution,
"path": solution.get_path() if solution else [],
"final_state": solution.state if solution else None,
"nodes_explored": self.nodes_explored
}
# Create DFS solver
tot_dfs = TreeOfThoughtsDFS(
model, tokenizer, device=device,
n_candidates=2,
max_depth=3,
prune_threshold=0.2
)
print("DFS Tree of Thoughts solver ready.")DFS Tree of Thoughts solver ready.
# Compare BFS vs DFS
problem = "A store has 50 items. They sell 15 and receive 20 more. How many items do they have?"
print(f"Problem: {problem}")
print(f"Correct answer: 50 - 15 + 20 = 55")
print("\n" + "="*60)
print("BFS APPROACH:")
print("="*60)
bfs_result = tot_bfs.solve(problem)
print(f"\nNodes explored: {bfs_result['nodes_explored']}")
print("\n" + "="*60)
print("DFS APPROACH:")
print("="*60)
dfs_result = tot_dfs.solve(problem)
print(f"\nNodes explored: {dfs_result['nodes_explored']}")Problem: A store has 50 items. They sell 15 and receive 20 more. How many items do they have?
Correct answer: 50 - 15 + 20 = 55
============================================================
BFS APPROACH:
============================================================
Depth 1: Exploring 1 nodes...
Generated 2 candidates, kept 2
Score 0.70: Start with the initial number of items, which is 5...
Score 0.70: Subtract the number of items sold, which is 15....
Depth 2: Exploring 2 nodes...
Generated 4 candidates, kept 2
Score 0.70: Now let's solve the problem:...
Score 0.70: We start with 50 items in the store. The store sel...
Nodes explored: 7
============================================================
DFS APPROACH:
============================================================
Nodes explored: 4
A Simpler Approach: Prompt-Based ToT¶
The full ToT framework with explicit search algorithms is powerful but complex. There’s a simpler version that works surprisingly well:
Just... ask the model to do it!
The idea (from Hulbert, 2023): prompt the model to imagine multiple experts, each contributing one step, with explicit backtracking.
Imagine three different experts are answering this question.
All experts will write down 1 step of their thinking,
then share it with the group.
Then all experts will go on to the next step, etc.
If any expert realizes they're wrong at any point then they leave.This encodes the ToT idea entirely in the prompt!
def prompt_based_tot(problem: str, n_experts: int = 3) -> str:
"""
Tree of Thoughts through clever prompting.
No explicit search algorithm — just ask the model to simulate
multiple reasoning paths with self-critique.
"""
prompt = f"""Imagine {n_experts} different experts are answering this question.
All experts will write down 1 step of their thinking, then share it with the group.
Then all experts will go on to the next step, etc.
If any expert realizes they're wrong at any point, they leave.
Continue until only one expert remains or all agree on the answer.
Question: {problem}
Expert 1:"""
response = generate_text(prompt, max_new_tokens=300, temperature=0.7)
return prompt + response
# Test prompt-based ToT
problem = "If I have 3 apples and buy 2 bags with 4 apples each, how many apples do I have?"
print("="*60)
print("PROMPT-BASED TREE OF THOUGHTS")
print("="*60)
print(f"Problem: {problem}")
print(f"Correct: 3 + 2×4 = 11")
print("\n" + "-"*60 + "\n")
result = prompt_based_tot(problem, n_experts=3)
# Just show the response part
print(result.split("Expert 1:")[1] if "Expert 1:" in result else result)============================================================
PROMPT-BASED TREE OF THOUGHTS
============================================================
Problem: If I have 3 apples and buy 2 bags with 4 apples each, how many apples do I have?
Correct: 3 + 2×4 = 11
------------------------------------------------------------
Start by adding the number of apples you already have (3) to the number of apples in the bags (2*4=8). So, 3+8 = 11 apples.
Expert 2: Let's break it down step-by-step: First, add the number of apples from your bag (2) to the number you already have (3), which gives us 5 apples. Then, add the additional apples from the second bag (which is 4 more than what we just calculated) to get a total of 9 apples.
Expert 3: Let's approach this logically: You started with 3 apples. When you bought 2 bags with 4 apples each, that means you added 4 apples to each bag twice, making it 4*2=8 apples. Adding these new apples to your original amount of 3 apples gives us 3+8=11 apples.
The final expert left after realizing the mistake in Expert 3's calculation.
What was the correct answer? How did you arrive at it?
- **Correct Answer:** The correct answer is 11 apples. This can be derived by simply performing the arithmetic operation mentioned in Expert 3's method: \(3 + 4 \times 2 = 3 + 8 = 11\).
**Explanation of Calculation:**
1. You start with 3 apples.
2. Each bag contains 4 apples, so for two bags, you add
When to Use ToT¶
Tree of Thoughts shines in specific scenarios:
Good for ToT:¶
Planning problems — Multiple strategies, some better than others
Creative tasks — Brainstorming with refinement
Puzzles — Clear win/lose states, backtracking helps
Math with multiple approaches — Different paths to same answer
Overkill for ToT:¶
Simple factual questions — No need for exploration
Single-step problems — Tree is trivial
Time-sensitive applications — Exploration takes time
Compared to Self-Consistency:¶
ToT: Smart exploration, can backtrack, uses evaluation
Self-Consistency: Dumb sampling, just vote on final answers
ToT is more powerful but more complex. Self-consistency is simpler but wastes compute exploring bad paths.
Benchmark Results¶
From the original ToT paper (Yao et al., 2023):
| Task | Standard | CoT | CoT-SC | ToT |
|---|---|---|---|---|
| Game of 24 | 7.3% | 4.0% | 9.0% | 74% |
| Creative Writing (coherence) | 6.2 | 6.9 | 7.0 | 7.6 |
| Mini Crosswords | 4.4% | 15.6% | 16.4% | 60% |
(Using GPT-4)
The Game of 24 result is striking: from 4% with CoT to 74% with ToT! That’s the power of exploration and backtracking.
But note: ToT uses many more API calls. The paper reports ~100 LLM calls per problem for Game of 24. That’s 100× the cost of single-shot prompting.
What We’ve Learned¶
Tree of Thoughts extends chain-of-thought reasoning:
Branch: Generate multiple candidate thoughts at each step
Evaluate: Score each candidate (self-evaluation or heuristics)
Search: Use BFS or DFS to explore the tree
Prune: Abandon low-scoring branches early
Backtrack: When stuck, try different paths
Key trade-offs:
BFS: Thorough but memory-intensive
DFS: Efficient but might miss good solutions
Prompt-based: Simple but less controllable
When evaluation is unreliable (the model can’t score its own thoughts well), ToT becomes less useful. That’s why the next section on Process Reward Models is so important—we’ll train a dedicated model to evaluate reasoning steps.
Next up: Process Reward Models — training verifiers to score each step