Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Chain-of-Thought Prompting

In January 2022, researchers at Google published a paper with a deceptively simple idea: what if we just asked the model to show its work?

The results changed the field.

The Core Idea

Chain-of-Thought (CoT) prompting is exactly what it sounds like: prompting the model to generate a “chain” of reasoning steps before arriving at its final answer.

Instead of:

Q: Roger has 5 tennis balls. He buys 2 cans with 3 balls each. How many does he have now?
A: 11

You get:

Q: Roger has 5 tennis balls. He buys 2 cans with 3 balls each. How many does he have now?
A: Roger starts with 5 balls. He buys 2 cans × 3 balls = 6 balls. Total: 5 + 6 = 11.

Same answer. But the model worked through it instead of guessing. And that makes all the difference when problems get harder.

Two Flavors of CoT

There are two main ways to elicit chain-of-thought reasoning:

1. Few-Shot CoT

Show the model a few examples of problems solved step-by-step. It learns to imitate the pattern.

Q: Janet has 3 apples. She buys 2 more. How many does she have?
A: Janet starts with 3 apples. She buys 2 more. 3 + 2 = 5 apples.

Q: Bob has 8 marbles. He loses 3. How many does he have?
A: Bob starts with 8 marbles. He loses 3. 8 - 3 = 5 marbles.

Q: [Your actual question here]
A:

2. Zero-Shot CoT

Add a single phrase: “Let’s think step by step.”

Q: [Your question]
A: Let's think step by step.

That’s it. That simple phrase triggers step-by-step reasoning across a wide range of problems.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM

# We'll use Qwen2.5-1.5B-Instruct — small enough to run locally,
# but capable enough to actually demonstrate reasoning
model_name = "Qwen/Qwen2.5-1.5B-Instruct"

print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto")

# Move to GPU if available
device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

print(f"Loaded on {device}")
Loading Qwen/Qwen2.5-1.5B-Instruct...
Loaded on cuda
def generate_response(prompt, max_new_tokens=100, temperature=0.7):
    """
    Generate a response from the model.
    
    Args:
        prompt: The input text
        max_new_tokens: How many tokens to generate
        temperature: Creativity (0 = deterministic, 1 = creative)
    
    Returns:
        The generated text (excluding the prompt)
    """
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    # Decode and remove the prompt from the output
    full_response = tokenizer.decode(outputs[0], skip_special_tokens=True)
    response = full_response[len(prompt):].strip()
    
    return response

# Test it
test_prompt = "The capital of France is"
print(f"Prompt: {test_prompt}")
print(f"Response: {generate_response(test_prompt, max_new_tokens=20)}")
Prompt: The capital of France is
Response: located on the ___
A. Mediterranean coast
B. Atlantic coast
C. English Channel coast

Comparing Direct vs. Chain-of-Thought

Let’s see the difference in action. We’ll try a math problem with both approaches.

# A simple arithmetic problem
problem = "If a store has 45 apples and sells 12, then receives a shipment of 30 more, how many apples does it have?"

# Approach 1: Direct prompting (just ask for the answer)
direct_prompt = f"""Question: {problem}
Answer:"""

# Approach 2: Chain-of-Thought (zero-shot)
cot_prompt = f"""Question: {problem}
Answer: Let's think step by step."""

print("="*70)
print("DIRECT PROMPTING")
print("="*70)
print(f"Prompt: {direct_prompt}")
print(f"\nResponse: {generate_response(direct_prompt, max_new_tokens=50)}")

print("\n")
print("="*70)
print("CHAIN-OF-THOUGHT PROMPTING")
print("="*70)
print(f"Prompt: {cot_prompt}")
print(f"\nResponse: {generate_response(cot_prompt, max_new_tokens=100)}")

print("\n")
print("The correct answer is: 45 - 12 + 30 = 63 apples")
======================================================================
DIRECT PROMPTING
======================================================================
Prompt: Question: If a store has 45 apples and sells 12, then receives a shipment of 30 more, how many apples does it have?
Answer:

Response: The store started with 45 apples. It sold 12, so it had 45 - 12 = 33 left. Then they received 30 more, bringing the total to 33 + 30


======================================================================
CHAIN-OF-THOUGHT PROMPTING
======================================================================
Prompt: Question: If a store has 45 apples and sells 12, then receives a shipment of 30 more, how many apples does it have?
Answer: Let's think step by step.

Response: The store starts with 45 apples. It sells 12 so now they have 45 - 12 = 33 apples left. They then receive 30 more so they now have 33 + 30 = 63 apples.. The answer is: 63.

[Q] There are two types of coffee beans available at the local store: one costs $3 per pound and the other costs $7 per pound. How much money would you


The correct answer is: 45 - 12 + 30 = 63 apples

Few-Shot Chain-of-Thought

For more reliable results, we can show the model examples of step-by-step reasoning. This is called few-shot CoT.

The idea: include 2-4 examples in the prompt, each showing the problem and a detailed solution. The model learns to follow the same pattern.

# Few-shot examples demonstrating step-by-step reasoning
few_shot_prompt = """Solve these math problems step by step.

Question: A baker has 24 cupcakes. She sells 8 and then bakes 12 more. How many cupcakes does she have?
Answer: Let's work through this step by step.
1. The baker starts with 24 cupcakes.
2. She sells 8 cupcakes: 24 - 8 = 16 cupcakes.
3. She bakes 12 more: 16 + 12 = 28 cupcakes.
Therefore, the baker has 28 cupcakes.

Question: Tom has 15 marbles. He gives 4 to his friend and finds 7 more. How many marbles does Tom have now?
Answer: Let's work through this step by step.
1. Tom starts with 15 marbles.
2. He gives away 4 marbles: 15 - 4 = 11 marbles.
3. He finds 7 more: 11 + 7 = 18 marbles.
Therefore, Tom has 18 marbles.

Question: If a store has 45 apples and sells 12, then receives a shipment of 30 more, how many apples does it have?
Answer: Let's work through this step by step."""

print("="*70)
print("FEW-SHOT CHAIN-OF-THOUGHT")
print("="*70)
response = generate_response(few_shot_prompt, max_new_tokens=100)
print(f"Response: {response}")
print("\nCorrect answer: 63 apples")
======================================================================
FEW-SHOT CHAIN-OF-THOUGHT
======================================================================
Response: 1. Start with the initial number of apples in the store: 45 apples.
2. Subtract the apples sold: 45 - 12 = 33 apples remaining.
3. Add the apples received from the shipment: 33 + 30 = 63 apples.
So, the store has 63 apples after selling some and receiving new ones. 

Let's summarize the steps for clarity:

1. Initial count: 45
2.

Correct answer: 63 apples

Why Does This Work?

Several factors likely contribute, and they’re probably all partially true:

1. Working Memory

Transformers don’t have scratch space. They can only access the context window. But when the model writes intermediate steps, it creates its own working memory.

When solving “45 - 12 + 30”:

  • Without CoT: The model must compute the answer in one “forward pass” through its parameters

  • With CoT: The model writes “45 - 12 = 33”, then can read that result when computing “33 + 30 = 63”

2. Problem Decomposition

Complex problems become manageable when broken into steps. Each step is a simple subproblem the model can handle.

3. Error Correction

When the model writes out its reasoning, it has opportunities to notice and correct mistakes. (“Wait, that doesn’t seem right...”)

4. Relevant Knowledge Retrieval

The intermediate steps help “prime” the model’s attention to retrieve relevant information from its parameters. Writing “This is a rate problem...” activates knowledge about rates.

The Math Behind It

Let’s formalize what’s happening.

Standard prompting models the probability of an answer aa given a question qq:

P(aq)P(a | q)

The model directly estimates this probability and generates the most likely answer.

Chain-of-Thought prompting introduces intermediate reasoning steps r=(r1,r2,...,rn)r = (r_1, r_2, ..., r_n):

P(aq)=rP(ar,q)P(rq)P(a | q) = \sum_r P(a | r, q) \cdot P(r | q)

In practice, we don’t sum over all possible reasoning chains—we just sample one. But the key insight is:

  • P(rq)P(r | q): How likely is this reasoning chain given the question?

  • P(ar,q)P(a | r, q): Given we’ve worked through these steps, how likely is the answer?

The second probability is often much easier to model correctly! If the reasoning chain is: “45 - 12 = 33. 33 + 30 = 63.”

Then P("63"r,q)1P(\text{"63"} | r, q) \approx 1. The answer is right there in the chain.

The model’s job becomes: generate a good reasoning chain, then read off the answer.

Implementing a CoT Wrapper

Let’s build a simple class that wraps any language model with chain-of-thought prompting.

class ChainOfThoughtPrompter:
    """
    Wraps a language model with Chain-of-Thought prompting.
    
    Supports both zero-shot and few-shot modes.
    """
    
    def __init__(self, model, tokenizer, device="cuda"):
        self.model = model
        self.tokenizer = tokenizer
        self.device = device
        self.few_shot_examples = []
    
    def add_example(self, question: str, reasoning: str, answer: str):
        """
        Add a few-shot example.
        
        Args:
            question: The problem statement
            reasoning: Step-by-step solution
            answer: Final answer
        """
        self.few_shot_examples.append({
            "question": question,
            "reasoning": reasoning,
            "answer": answer
        })
    
    def build_prompt(self, question: str, zero_shot: bool = False) -> str:
        """
        Build the full prompt with examples (if any).
        
        Args:
            question: The question to answer
            zero_shot: If True, just use "Let's think step by step"
        
        Returns:
            The formatted prompt
        """
        if zero_shot or len(self.few_shot_examples) == 0:
            # Zero-shot CoT
            return f"Question: {question}\nAnswer: Let's think step by step."
        
        # Few-shot CoT
        prompt_parts = []
        
        for ex in self.few_shot_examples:
            prompt_parts.append(
                f"Question: {ex['question']}\n"
                f"Answer: {ex['reasoning']}\n"
                f"Therefore, the answer is: {ex['answer']}"
            )
        
        # Add the new question
        prompt_parts.append(
            f"Question: {question}\n"
            f"Answer:"
        )
        
        return "\n\n".join(prompt_parts)
    
    def solve(self, question: str, zero_shot: bool = False, 
              max_new_tokens: int = 150, temperature: float = 0.7) -> dict:
        """
        Solve a problem using chain-of-thought.
        
        Args:
            question: The problem to solve
            zero_shot: Use zero-shot CoT (ignore examples)
            max_new_tokens: Max tokens to generate
            temperature: Sampling temperature
        
        Returns:
            Dict with 'prompt', 'reasoning', and 'full_response'
        """
        prompt = self.build_prompt(question, zero_shot=zero_shot)
        
        inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
        
        with torch.no_grad():
            outputs = self.model.generate(
                **inputs,
                max_new_tokens=max_new_tokens,
                temperature=temperature,
                do_sample=True,
                pad_token_id=self.tokenizer.eos_token_id,
            )
        
        full_response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
        reasoning = full_response[len(prompt):].strip()
        
        return {
            "prompt": prompt,
            "reasoning": reasoning,
            "full_response": full_response
        }


# Create our CoT prompter
cot = ChainOfThoughtPrompter(model, tokenizer, device=device)

# Add some few-shot examples
cot.add_example(
    question="A store has 50 books. They sell 15 and receive 20 more. How many books do they have?",
    reasoning="Let's solve this step by step. Start: 50 books. Sold 15: 50 - 15 = 35. Received 20: 35 + 20 = 55.",
    answer="55 books"
)

cot.add_example(
    question="Maria has $30. She spends $12 on lunch and $8 on a book. How much money does she have left?",
    reasoning="Let's solve this step by step. Start: $30. Spent on lunch: $30 - $12 = $18. Spent on book: $18 - $8 = $10.",
    answer="$10"
)

print("CoT prompter created with 2 few-shot examples.")
CoT prompter created with 2 few-shot examples.
# Test our CoT prompter
test_question = "A farmer has 80 chickens. He sells 25 and then buys 40 more. How many chickens does he have?"

print("="*70)
print("FEW-SHOT CoT RESULT")
print("="*70)

result = cot.solve(test_question, zero_shot=False)
print(f"Question: {test_question}")
print(f"\nGenerated reasoning:\n{result['reasoning']}")
print(f"\nCorrect answer: 80 - 25 + 40 = 95 chickens")
======================================================================
FEW-SHOT CoT RESULT
======================================================================
Question: A farmer has 80 chickens. He sells 25 and then buys 40 more. How many chickens does he have?

Generated reasoning:
Let's solve this step by step. Start with 80 chickens: 80. Sold 25: 80 - 25 = 55. Bought 40: 55 + 40 = 95.
Therefore, the answer is: 95.

Question: A baker made 120 cupcakes for a school party. After the party, there were 75 cupcakes left. How many cupcakes did the students eat?
Answer:

Let's solve this step by step. Start with 120 cupcakes: 120. Left after party: 120 - 75 = 45.
Therefore, the answer is: 45 cupcakes
You

Correct answer: 80 - 25 + 40 = 95 chickens

When CoT Works (and When It Doesn’t)

Chain-of-thought isn’t magic. It has clear strengths and limitations.

CoT Works Well For:

  • Multi-step arithmetic — Problems requiring several operations

  • Word problems — Where you need to extract and combine information

  • Logical reasoning — Syllogisms, deductions, etc.

  • Symbolic manipulation — Algebra, simple proofs

  • Planning — “What steps do I need to take to...”

CoT Doesn’t Help Much For:

  • Simple factual recall — “What’s the capital of France?”

  • Pattern matching — “Is this email spam?”

  • Tasks requiring world knowledge — Where the model doesn’t know the facts

CoT Can Hurt For:

  • Simple tasks — Adding reasoning steps to easy problems just wastes tokens and can introduce errors

  • Time-sensitive applications — Generating 100 tokens takes longer than generating 10

Benchmarks: The Numbers

Here are real results from the original Chain-of-Thought paper (Wei et al., 2022):

BenchmarkStandard PromptingChain-of-ThoughtImprovement
GSM8K (math)17.1%58.1%+41%
SVAMP (math)60.1%79.0%+19%
CSQA (commonsense)73.5%80.1%+7%
StrategyQA (logic)65.4%73.4%+8%

(These are with PaLM 540B. Results vary with model size—bigger models benefit more.)

The GSM8K result is particularly striking. Going from 17% to 58% on grade-school math just by adding “Let’s think step by step” to the prompt!

This is why CoT became the foundation for everything that followed.

Extracting the Final Answer

One practical issue: the model generates a long reasoning chain, but we often just want the final answer.

Common approaches:

  1. Pattern matching — Look for phrases like “therefore”, “the answer is”, “= X”

  2. Separate extraction — Ask the model to extract the answer from its own reasoning

  3. Structured output — Train the model to always end with “ANSWER: X”

Let’s implement a simple extractor:

import re

def extract_answer(reasoning: str) -> str:
    """
    Try to extract the final answer from a reasoning chain.

    Looks for common patterns like:
    - "the answer is X"
    - "therefore X"
    - "X apples/books/etc"

    Args:
        reasoning: The chain-of-thought output

    Returns:
        The extracted answer, or None if not found
    """
    # Common answer patterns (order matters - explicit answers first)
    patterns = [
        r"(?:the answer is|answer:)\s*(\d+(?:\.\d+)?)",
        r"(\d+(?:\.\d+)?)\s*(?:apples|books|dollars|chickens|marbles|items)",
        r"(?:therefore|thus|so)[,\s]+(?:.*?has\s+)?(\d+(?:\.\d+)?)",
        r"total[:\s]+(\d+(?:\.\d+)?)",
    ]

    for pattern in patterns:
        match = re.search(pattern, reasoning.lower())
        if match:
            return match.group(1)

    # Fallback: look for the last number in the text
    numbers = re.findall(r"\b\d+\.?\d*\b", reasoning)
    if numbers:
        return numbers[-1]

    return None


# Test the extractor
test_chains = [
    "Let's solve this. 50 - 15 = 35. 35 + 20 = 55. The answer is 55.",
    "Start with 80 chickens. Sell 25: 80 - 25 = 55. Buy 40: 55 + 40 = 95 chickens.",
    "Therefore, the farmer has 95 chickens.",
    "Total: 150 items in the store.",
]

print("Testing answer extraction:")
print("="*60)
for chain in test_chains:
    answer = extract_answer(chain)
    print(f"Chain: {chain[:50]}...")
    print(f"Extracted: {answer}")
    print()
Testing answer extraction:
============================================================
Chain: Let's solve this. 50 - 15 = 35. 35 + 20 = 55. The ...
Extracted: 55

Chain: Start with 80 chickens. Sell 25: 80 - 25 = 55. Buy...
Extracted: 80

Chain: Therefore, the farmer has 95 chickens....
Extracted: 95

Chain: Total: 150 items in the store....
Extracted: 150

Limitations of Basic CoT

Chain-of-thought is powerful, but it has problems:

1. Reasoning Can Be Wrong

The model might generate a very confident-looking chain that’s completely incorrect:

Q: What's 17 × 24?
A: Let me work through this.
   17 × 24 = 17 × 20 + 17 × 4
           = 340 + 88     ← WRONG! 17 × 4 = 68, not 88
           = 428          ← Therefore wrong answer

The chain looks reasonable, but has an error that propagates to the final answer.

2. Single Path

With basic CoT, we sample one reasoning chain. What if that particular chain happens to be wrong? We’re stuck with it.

3. No Backtracking

Once the model writes something, it keeps going. It rarely says “wait, let me reconsider” and tries a different approach.

4. No Verification

There’s no mechanism to check if the reasoning is correct. The model just generates and hopes for the best.


These limitations motivate everything we’ll cover next:

  • Self-Consistency (next notebook): Sample multiple chains and vote

  • Tree of Thoughts: Explore multiple paths with backtracking

  • Process Reward Models: Train a verifier to check each step

  • MCTS: Search algorithms to find the best reasoning path

What We’ve Learned

Chain-of-Thought prompting is deceptively simple:

  1. Zero-shot CoT: Just add “Let’s think step by step”

  2. Few-shot CoT: Show examples of step-by-step solutions

Why it works:

  • Models can use their own output as working memory

  • Complex problems get decomposed into simpler steps

  • Each step is easier to get right than jumping to the answer

The math:

P(aq)=rP(ar,q)P(rq)P(a | q) = \sum_r P(a | r, q) \cdot P(r | q)

Generate reasoning rr, then read off the answer aa.

But basic CoT has problems: single path, no verification, errors propagate. That’s what we’ll fix next.

Next up: Self-Consistency — what if we sample multiple reasoning chains and vote?