Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Tree of Thoughts

Chain-of-thought is linear: one step follows another, marching toward an answer. But human problem-solving isn’t linear. We explore options, hit dead ends, backtrack, and try different approaches.

Tree of Thoughts (ToT) brings that flexibility to language models.

From Chain to Tree

Compare the two paradigms:

Chain-of-Thought:

Question → Step 1 → Step 2 → Step 3 → Answer
                                      (hope it's right!)

Tree of Thoughts:

                    Question
                       │
           ┌───────────┼───────────┐
           ▼           ▼           ▼
        Step 1a     Step 1b     Step 1c
           │           │           ✗ (bad idea, prune)
       ┌───┴───┐   ┌───┴───┐
       ▼       ▼   ▼       ▼
    Step 2a  Step 2b  Step 2c  Step 2d
       │       ✗       │       │
       ▼               ▼       ▼
    Answer          Answer  Answer
                       ✓

Key differences:

  1. Branching: At each step, generate multiple possible continuations

  2. Evaluation: Score each branch to identify promising vs. hopeless paths

  3. Pruning: Abandon bad branches early (save compute!)

  4. Backtracking: If a path leads nowhere, try a different one

The ToT Framework

Tree of Thoughts has four key components:

1. Thought Decomposition

How do we break the problem into “thoughts” (intermediate steps)?

  • For math: each computation step

  • For writing: each paragraph or idea

  • For puzzles: each move or decision

2. Thought Generation

How do we generate candidate thoughts at each step?

  • Sample: Generate N thoughts independently (like self-consistency)

  • Propose: Have the model propose several alternatives in one call

3. Thought Evaluation

How do we know which thoughts are promising?

  • Self-evaluation: Ask the model to rate its own thoughts

  • Voting: Generate continuations, see which lead to success

  • Heuristics: Problem-specific scoring functions

4. Search Algorithm

How do we navigate the tree?

  • Breadth-First Search (BFS): Explore all options at each level before going deeper

  • Depth-First Search (DFS): Go deep first, backtrack on failure

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from dataclasses import dataclass
from typing import List, Optional, Callable
import re

# Load model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
print(f"Loading {model_name}...")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

print(f"Loaded on {device}")
Loading Qwen/Qwen2.5-1.5B-Instruct...
Loaded on cuda
@dataclass
class ThoughtNode:
    """
    A node in the Tree of Thoughts.
    
    Each node represents a partial solution state.
    """
    thought: str           # The reasoning step at this node
    state: str             # Full state (all thoughts so far)
    score: float = 0.0     # Evaluation score
    depth: int = 0         # How deep in the tree
    parent: Optional['ThoughtNode'] = None
    children: List['ThoughtNode'] = None
    
    def __post_init__(self):
        if self.children is None:
            self.children = []
    
    def get_path(self) -> List[str]:
        """Get the full reasoning path from root to this node."""
        path = []
        node = self
        while node is not None:
            path.append(node.thought)
            node = node.parent
        return list(reversed(path))


def generate_text(prompt: str, max_new_tokens: int = 50, 
                  temperature: float = 0.7) -> str:
    """Generate text continuation."""
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return full_response[len(prompt):].strip()


print("ThoughtNode defined.")
ThoughtNode defined.

Example: Game of 24

The original ToT paper used the “Game of 24” as a key benchmark. The rules:

  • You’re given 4 numbers (e.g., 1, 2, 3, 4)

  • Use +, -, ×, ÷ to make 24

  • Use each number exactly once

Example: 1, 2, 3, 4 → (1 + 2 + 3) × 4 = 24 ✓

This is perfect for ToT because:

  • There are many possible paths (which operation to try first?)

  • Some paths are dead ends (can’t reach 24)

  • We can evaluate intermediate states (“is 24 still reachable?”)

def generate_thoughts(state: str, n_thoughts: int = 3,
                      problem_type: str = "math") -> List[str]:
    """
    Generate multiple candidate next thoughts.
    
    We ask the model to propose several possible next steps.
    
    Args:
        state: Current reasoning state
        n_thoughts: Number of candidates to generate
        problem_type: Type of problem (affects prompting)
    
    Returns:
        List of candidate next thoughts
    """
    prompt = f"""{state}

What are {n_thoughts} different possible next steps? List them as:
1.
2.
3."""
    
    response = generate_text(prompt, max_new_tokens=100, temperature=0.8)
    
    # Parse numbered list
    thoughts = []
    lines = response.split('\n')
    for line in lines:
        # Match lines starting with a number
        match = re.match(r'^\d+\.\s*(.+)', line.strip())
        if match:
            thoughts.append(match.group(1).strip())
    
    # If parsing failed, just split by newlines
    if len(thoughts) < n_thoughts and lines:
        thoughts = [l.strip() for l in lines if l.strip()][:n_thoughts]
    
    return thoughts[:n_thoughts]


# Test thought generation
test_state = """Problem: Use 2, 3, 4, 6 with +, -, ×, ÷ to make 24.
Current: I have numbers 2, 3, 4, 6."""

print("Generating candidate thoughts...\n")
thoughts = generate_thoughts(test_state, n_thoughts=3)
for i, t in enumerate(thoughts, 1):
    print(f"{i}. {t}")
Generating candidate thoughts...

1. To create the number 24 using the numbers 2, 3, 4, and 6 along with the operations of addition (+), subtraction (-), multiplication (×), and division (÷), we need to find a combination that equals 24. Let's explore three different combinations:
2. ### Combination 1:
3. \[ 6 \times 4 - 3 - 2 = 24 \]

Self-Evaluation

The key to ToT is evaluating which thoughts are promising. We can ask the model to rate its own ideas.

For Game of 24, we might ask:

  • “Can you still make 24 from the remaining numbers?”

  • Rate as: sure / likely / unlikely / impossible

def evaluate_thought(state: str, thought: str,
                     evaluation_prompt: str = None) -> float:
    """
    Have the model evaluate a thought on a 0-1 scale.
    
    We ask the model to rate the thought as:
    - sure (1.0): This will definitely work
    - likely (0.7): This looks promising
    - possible (0.4): Maybe, not sure
    - unlikely (0.1): Probably won't work
    
    Args:
        state: Current problem state
        thought: The candidate thought to evaluate
        evaluation_prompt: Custom evaluation prompt
    
    Returns:
        Score from 0.0 to 1.0
    """
    if evaluation_prompt is None:
        evaluation_prompt = f"""{state}

Proposed next step: {thought}

Evaluate this step. Is it a good approach to solving the problem?
Rate as: sure / likely / possible / unlikely

Rating:"""
    
    response = generate_text(evaluation_prompt, max_new_tokens=10, temperature=0.3)
    response_lower = response.lower().strip()
    
    # Parse the rating
    if 'sure' in response_lower:
        return 1.0
    elif 'likely' in response_lower:
        return 0.7
    elif 'possible' in response_lower:
        return 0.4
    elif 'unlikely' in response_lower or 'impossible' in response_lower:
        return 0.1
    else:
        # Default to middle score if we can't parse
        return 0.5


# Test evaluation
state = "Problem: Find a word that starts with 'A' and means 'happy'."
thoughts = ["The word 'Awesome' might work.", "The word 'Sad' could be it.", "I'll think about 'Afraid'."]

print("Evaluating thoughts:\n")
for thought in thoughts:
    score = evaluate_thought(state, thought)
    print(f"  {thought}")
    print(f"  → Score: {score:.2f}\n")
Evaluating thoughts:

  The word 'Awesome' might work.
  → Score: 0.40

  The word 'Sad' could be it.
  → Score: 0.70

  I'll think about 'Afraid'.
  → Score: 0.40

Breadth-First Search (BFS)

BFS explores the tree level by level:

  1. Start with the initial state

  2. Generate all possible first steps

  3. Evaluate and keep the best kk (“beam width”)

  4. For each kept state, generate next steps

  5. Repeat until reaching a solution or max depth

class TreeOfThoughtsBFS:
    """
    Tree of Thoughts with Breadth-First Search.
    
    Explores all options at each level, keeps top-k,
    then goes deeper.
    """
    
    def __init__(self, model, tokenizer, device="cuda",
                 n_candidates: int = 3, beam_width: int = 2, max_depth: int = 3):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device
        self.n_candidates = n_candidates  # How many thoughts to generate per node
        self.beam_width = beam_width      # How many nodes to keep at each level
        self.max_depth = max_depth        # How deep to search
    
    def solve(self, problem: str, is_solution: Callable[[str], bool] = None) -> dict:
        """
        Solve a problem using BFS over thoughts.
        
        Args:
            problem: The problem statement
            is_solution: Function to check if a state is a solution
        
        Returns:
            Dict with best solution, path, and all explored nodes
        """
        # Initialize with root node
        root = ThoughtNode(
            thought="Start",
            state=f"Problem: {problem}\n\nLet me think step by step.",
            score=1.0,
            depth=0
        )
        
        current_level = [root]
        all_nodes = [root]
        best_solution = None
        best_score = -float('inf')
        
        for depth in range(self.max_depth):
            print(f"\nDepth {depth + 1}: Exploring {len(current_level)} nodes...")
            
            next_level = []
            
            for node in current_level:
                # Generate candidate thoughts
                thoughts = generate_thoughts(
                    node.state, 
                    n_thoughts=self.n_candidates
                )
                
                for thought in thoughts:
                    # Create new state
                    new_state = f"{node.state}\n\nStep {depth + 1}: {thought}"
                    
                    # Evaluate
                    score = evaluate_thought(node.state, thought)
                    
                    # Create child node
                    child = ThoughtNode(
                        thought=thought,
                        state=new_state,
                        score=score,
                        depth=depth + 1,
                        parent=node
                    )
                    node.children.append(child)
                    all_nodes.append(child)
                    next_level.append(child)
                    
                    # Check if solution
                    if is_solution and is_solution(new_state):
                        if score > best_score:
                            best_solution = child
                            best_score = score
            
            # Keep only top-k nodes for next level
            next_level.sort(key=lambda n: n.score, reverse=True)
            current_level = next_level[:self.beam_width]
            
            print(f"  Generated {len(next_level)} candidates, kept {len(current_level)}")
            for node in current_level:
                print(f"    Score {node.score:.2f}: {node.thought[:50]}...")
        
        # If no explicit solution found, return best final node
        if best_solution is None and current_level:
            best_solution = max(current_level, key=lambda n: n.score)
        
        return {
            "solution": best_solution,
            "path": best_solution.get_path() if best_solution else [],
            "final_state": best_solution.state if best_solution else None,
            "all_nodes": all_nodes,
            "nodes_explored": len(all_nodes)
        }


# Create ToT solver
tot_bfs = TreeOfThoughtsBFS(
    model, tokenizer, device=device,
    n_candidates=2,  # Generate 2 candidates per node
    beam_width=2,    # Keep top 2 at each level
    max_depth=2      # Go 2 levels deep
)

print("BFS Tree of Thoughts solver ready.")
BFS Tree of Thoughts solver ready.
# Test ToT-BFS on a simple problem
problem = "What is 15 + 28? Show your reasoning."

print(f"Problem: {problem}")
print("\n" + "="*60)

result = tot_bfs.solve(problem)

print("\n" + "="*60)
print("SOLUTION PATH:")
print("="*60)
for i, step in enumerate(result["path"]):
    print(f"Step {i}: {step}")

print(f"\nTotal nodes explored: {result['nodes_explored']}")
Problem: What is 15 + 28? Show your reasoning.

============================================================

Depth 1: Exploring 1 nodes...
  Generated 2 candidates, kept 2
    Score 0.70: Count up 28 more units from 15....
    Score 0.50: Start with 15 on the right side of the plus sign....

Depth 2: Exploring 2 nodes...
  Generated 4 candidates, kept 2
    Score 0.70: Step 2: Determine which of the two sums you get in...
    Score 0.70: Which sum do I want to use for my final answer?...

============================================================
SOLUTION PATH:
============================================================
Step 0: Start
Step 1: Count up 28 more units from 15.
Step 2: Step 2: Determine which of the two sums you get in Step 1 is correct.

Total nodes explored: 7

Depth-First Search (DFS)

DFS explores deeply first, then backtracks:

  1. Start with initial state

  2. Generate candidates for next step

  3. Pick the best one and go deeper

  4. If stuck (bad evaluation), backtrack and try another branch

DFS is more memory-efficient than BFS and can find solutions faster for some problems. But it might miss better solutions on other branches.

class TreeOfThoughtsDFS:
    """
    Tree of Thoughts with Depth-First Search.
    
    Explores deeply first, backtracks on failure.
    """
    
    def __init__(self, model, tokenizer, device="cuda",
                 n_candidates: int = 3, max_depth: int = 4,
                 prune_threshold: float = 0.2):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device
        self.n_candidates = n_candidates
        self.max_depth = max_depth
        self.prune_threshold = prune_threshold  # Don't explore if score < this
        self.nodes_explored = 0
    
    def dfs(self, node: ThoughtNode, is_solution: Callable = None) -> Optional[ThoughtNode]:
        """
        Recursive DFS exploration.
        
        Returns the first solution found, or None.
        """
        self.nodes_explored += 1
        
        # Check depth limit
        if node.depth >= self.max_depth:
            return node if is_solution is None or is_solution(node.state) else None
        
        # Check if already a solution
        if is_solution and is_solution(node.state):
            return node
        
        # Generate and evaluate candidates
        thoughts = generate_thoughts(node.state, n_thoughts=self.n_candidates)
        candidates = []
        
        for thought in thoughts:
            new_state = f"{node.state}\n\nStep {node.depth + 1}: {thought}"
            score = evaluate_thought(node.state, thought)
            
            # Prune low-scoring candidates
            if score < self.prune_threshold:
                continue
                
            candidates.append((thought, new_state, score))
        
        # Sort by score, explore best first
        candidates.sort(key=lambda x: x[2], reverse=True)
        
        for thought, new_state, score in candidates:
            child = ThoughtNode(
                thought=thought,
                state=new_state,
                score=score,
                depth=node.depth + 1,
                parent=node
            )
            node.children.append(child)
            
            # Recurse
            result = self.dfs(child, is_solution)
            if result is not None:
                return result
        
        # No solution found in this subtree
        return None
    
    def solve(self, problem: str, is_solution: Callable = None) -> dict:
        """
        Solve a problem using DFS.
        """
        self.nodes_explored = 0
        
        root = ThoughtNode(
            thought="Start",
            state=f"Problem: {problem}\n\nLet me think step by step.",
            score=1.0,
            depth=0
        )
        
        solution = self.dfs(root, is_solution)
        
        return {
            "solution": solution,
            "path": solution.get_path() if solution else [],
            "final_state": solution.state if solution else None,
            "nodes_explored": self.nodes_explored
        }


# Create DFS solver
tot_dfs = TreeOfThoughtsDFS(
    model, tokenizer, device=device,
    n_candidates=2,
    max_depth=3,
    prune_threshold=0.2
)

print("DFS Tree of Thoughts solver ready.")
DFS Tree of Thoughts solver ready.
# Compare BFS vs DFS
problem = "A store has 50 items. They sell 15 and receive 20 more. How many items do they have?"

print(f"Problem: {problem}")
print(f"Correct answer: 50 - 15 + 20 = 55")

print("\n" + "="*60)
print("BFS APPROACH:")
print("="*60)
bfs_result = tot_bfs.solve(problem)
print(f"\nNodes explored: {bfs_result['nodes_explored']}")

print("\n" + "="*60)
print("DFS APPROACH:")
print("="*60)
dfs_result = tot_dfs.solve(problem)
print(f"\nNodes explored: {dfs_result['nodes_explored']}")
Problem: A store has 50 items. They sell 15 and receive 20 more. How many items do they have?
Correct answer: 50 - 15 + 20 = 55

============================================================
BFS APPROACH:
============================================================

Depth 1: Exploring 1 nodes...
  Generated 2 candidates, kept 2
    Score 0.70: Start with the initial number of items, which is 5...
    Score 0.70: Subtract the number of items sold, which is 15....

Depth 2: Exploring 2 nodes...
  Generated 4 candidates, kept 2
    Score 0.70: Now let's solve the problem:...
    Score 0.70: We start with 50 items in the store. The store sel...

Nodes explored: 7

============================================================
DFS APPROACH:
============================================================

Nodes explored: 4

A Simpler Approach: Prompt-Based ToT

The full ToT framework with explicit search algorithms is powerful but complex. There’s a simpler version that works surprisingly well:

Just... ask the model to do it!

The idea (from Hulbert, 2023): prompt the model to imagine multiple experts, each contributing one step, with explicit backtracking.

Imagine three different experts are answering this question.
All experts will write down 1 step of their thinking,
then share it with the group.
Then all experts will go on to the next step, etc.
If any expert realizes they're wrong at any point then they leave.

This encodes the ToT idea entirely in the prompt!

def prompt_based_tot(problem: str, n_experts: int = 3) -> str:
    """
    Tree of Thoughts through clever prompting.
    
    No explicit search algorithm — just ask the model to simulate
    multiple reasoning paths with self-critique.
    """
    prompt = f"""Imagine {n_experts} different experts are answering this question.
All experts will write down 1 step of their thinking, then share it with the group.
Then all experts will go on to the next step, etc.
If any expert realizes they're wrong at any point, they leave.
Continue until only one expert remains or all agree on the answer.

Question: {problem}

Expert 1:"""
    
    response = generate_text(prompt, max_new_tokens=300, temperature=0.7)
    return prompt + response


# Test prompt-based ToT
problem = "If I have 3 apples and buy 2 bags with 4 apples each, how many apples do I have?"

print("="*60)
print("PROMPT-BASED TREE OF THOUGHTS")
print("="*60)
print(f"Problem: {problem}")
print(f"Correct: 3 + 2×4 = 11")
print("\n" + "-"*60 + "\n")

result = prompt_based_tot(problem, n_experts=3)
# Just show the response part
print(result.split("Expert 1:")[1] if "Expert 1:" in result else result)
============================================================
PROMPT-BASED TREE OF THOUGHTS
============================================================
Problem: If I have 3 apples and buy 2 bags with 4 apples each, how many apples do I have?
Correct: 3 + 2×4 = 11

------------------------------------------------------------

Start by adding the number of apples you already have (3) to the number of apples in the bags (2*4=8). So, 3+8 = 11 apples. 

Expert 2: Let's break it down step-by-step: First, add the number of apples from your bag (2) to the number you already have (3), which gives us 5 apples. Then, add the additional apples from the second bag (which is 4 more than what we just calculated) to get a total of 9 apples.

Expert 3: Let's approach this logically: You started with 3 apples. When you bought 2 bags with 4 apples each, that means you added 4 apples to each bag twice, making it 4*2=8 apples. Adding these new apples to your original amount of 3 apples gives us 3+8=11 apples.

The final expert left after realizing the mistake in Expert 3's calculation.

What was the correct answer? How did you arrive at it?
- **Correct Answer:** The correct answer is 11 apples. This can be derived by simply performing the arithmetic operation mentioned in Expert 3's method: \(3 + 4 \times 2 = 3 + 8 = 11\).

**Explanation of Calculation:** 
1. You start with 3 apples.
2. Each bag contains 4 apples, so for two bags, you add

When to Use ToT

Tree of Thoughts shines in specific scenarios:

Good for ToT:

  • Planning problems — Multiple strategies, some better than others

  • Creative tasks — Brainstorming with refinement

  • Puzzles — Clear win/lose states, backtracking helps

  • Math with multiple approaches — Different paths to same answer

Overkill for ToT:

  • Simple factual questions — No need for exploration

  • Single-step problems — Tree is trivial

  • Time-sensitive applications — Exploration takes time

Compared to Self-Consistency:

  • ToT: Smart exploration, can backtrack, uses evaluation

  • Self-Consistency: Dumb sampling, just vote on final answers

ToT is more powerful but more complex. Self-consistency is simpler but wastes compute exploring bad paths.

Benchmark Results

From the original ToT paper (Yao et al., 2023):

TaskStandardCoTCoT-SCToT
Game of 247.3%4.0%9.0%74%
Creative Writing (coherence)6.26.97.07.6
Mini Crosswords4.4%15.6%16.4%60%

(Using GPT-4)

The Game of 24 result is striking: from 4% with CoT to 74% with ToT! That’s the power of exploration and backtracking.

But note: ToT uses many more API calls. The paper reports ~100 LLM calls per problem for Game of 24. That’s 100× the cost of single-shot prompting.

What We’ve Learned

Tree of Thoughts extends chain-of-thought reasoning:

  1. Branch: Generate multiple candidate thoughts at each step

  2. Evaluate: Score each candidate (self-evaluation or heuristics)

  3. Search: Use BFS or DFS to explore the tree

  4. Prune: Abandon low-scoring branches early

  5. Backtrack: When stuck, try different paths

Key trade-offs:

  • BFS: Thorough but memory-intensive

  • DFS: Efficient but might miss good solutions

  • Prompt-based: Simple but less controllable

When evaluation is unreliable (the model can’t score its own thoughts well), ToT becomes less useful. That’s why the next section on Process Reward Models is so important—we’ll train a dedicated model to evaluate reasoning steps.

Next up: Process Reward Models — training verifiers to score each step