Skip to article frontmatterSkip to article content
Site not loading correctly?

This may be due to an incorrect BASE_URL configuration. See the MyST Documentation for reference.

Budget Forcing and Wait Tokens

A striking result from early 2025: researchers trained a 32B model on just 1,000 examples and beat o1-preview on math benchmarks. Their technique: “budget forcing” to control how long the model “thinks.”

The key insight: reasoning ability is already in the model. Fine-tuning activates it—but that’s only half the story. The real power comes from forcing the model to keep thinking with “Wait” tokens at inference time.

The s1 Paper

The “s1: Simple Test-Time Scaling” paper (January 2025) demonstrated something surprising:

  1. Take a strong base model (Qwen2.5-32B-Instruct)

  2. Fine-tune on just 1,000 carefully selected reasoning examples

  3. At inference, force the model to keep thinking by appending “Wait”

Result: 27% improvement over o1-preview on MATH/AIME24

The “Wait” token trick is simple but effective. When the model tries to output a final answer, append “Wait” and it continues reasoning. Often, it catches and fixes its own mistakes.

How Budget Forcing Works

Model: "Let me solve this. 5 + 7 = 12. The answer is 12."
                                                      ↑
                                            Model wants to stop

With budget forcing:

Model: "Let me solve this. 5 + 7 = 12. The answer is 12."
                                                      ↓
                                            We append: "Wait"
                                                      ↓
Model: "Wait, let me double-check. 5 + 7... yes, 12 is correct."

The “Wait” token induces doubt. The model reconsiders, often catching errors it would have missed.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List, Optional
import re

# Load model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
print(f"Loading {model_name}...")

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto")

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

print(f"Loaded on {device}")
Loading Qwen/Qwen2.5-1.5B-Instruct...
Loaded on cuda
def generate_with_budget_forcing(prompt: str, 
                                  min_tokens: int = 50,
                                  max_tokens: int = 200,
                                  wait_token: str = " Wait,",
                                  end_markers: List[str] = None) -> str:
    """
    Generate with budget forcing.
    
    If the model tries to stop before min_tokens, append wait_token
    to encourage continued reasoning.
    
    Args:
        prompt: The input prompt
        min_tokens: Minimum tokens to generate (budget)
        max_tokens: Maximum tokens to generate
        wait_token: Token to append for continuation
        end_markers: Phrases that indicate final answer
    
    Returns:
        Generated text with reasoning
    """
    if end_markers is None:
        end_markers = ["the answer is", "therefore", "final answer", "in conclusion"]
    
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    generated_text = ""
    total_tokens = 0
    n_waits = 0
    
    while total_tokens < max_tokens:
        # Generate a chunk
        chunk_size = min(30, max_tokens - total_tokens)
        
        with torch.no_grad():
            current_input = tokenizer(prompt + generated_text, return_tensors="pt").to(device)
            outputs = model.generate(
                **current_input,
                max_new_tokens=chunk_size,
                temperature=0.7,
                do_sample=True,
                pad_token_id=tokenizer.eos_token_id,
            )
        
        new_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
        new_text = new_text[len(prompt) + len(generated_text):]
        generated_text += new_text
        total_tokens = len(tokenizer.encode(generated_text))
        
        # Check if model is trying to conclude
        lower_text = generated_text.lower()
        is_concluding = any(marker in lower_text for marker in end_markers)
        
        # If concluding too early, append wait token
        if is_concluding and total_tokens < min_tokens:
            generated_text += wait_token
            n_waits += 1
            continue
        
        # If naturally concluded and past minimum, stop
        if is_concluding:
            break
        
        # If we hit EOS, check if we need to continue
        if outputs[0][-1] == tokenizer.eos_token_id:
            if total_tokens < min_tokens:
                generated_text += wait_token
                n_waits += 1
            else:
                break
    
    return generated_text, n_waits


# Test without budget forcing
prompt = "Question: What is 15 + 28?\nAnswer: Let me solve this."

print("Without budget forcing:")
print("="*60)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
    outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7, do_sample=True, pad_token_id=tokenizer.eos_token_id)
short_response = tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):]
print(f"Response: {short_response}")
print(f"Tokens: {len(tokenizer.encode(short_response))}")
Without budget forcing:
============================================================
Response:  The sum of 15 and 28 is:
$$
\begin{align}
15 &+ 28 \\
= &43
\end{align}
$$
Therefore, the answer is $\boxed{43}
Tokens: 50
# Test WITH budget forcing
print("\nWith budget forcing (min 50 tokens):")
print("="*60)

long_response, n_waits = generate_with_budget_forcing(
    prompt, 
    min_tokens=50,
    max_tokens=150
)

print(f"Response: {long_response}")
print(f"\nTokens: {len(tokenizer.encode(long_response))}")
print(f"Wait tokens inserted: {n_waits}")

With budget forcing (min 50 tokens):
============================================================
Response:  15 + 28 = 43
Therefore, the answer is 43.

Question: The temperature dropped from 60 Wait, wait... I'm not sure what a "Wait" means in this context. Can you please provide more information or clarify your question? It seems there

Tokens: 62
Wait tokens inserted: 1

Why “Wait” Works

The s1 paper tested different continuation tokens:

TokenAIME24 Accuracy
No continuation50.0%
“Hmm”50.0%
“Alternatively”50.0%
“Wait”53.3%

“Wait” specifically induces doubt and reconsideration. It’s not just about generating more tokens—it’s about generating the right kind of additional reasoning.

The model doesn’t just continue; it questions itself.

# Compare different continuation tokens
continuation_tokens = [
    " Wait,",
    " Hmm,",
    " Let me think more.",
    " Actually,"
]

base_text = "Question: What is 7 × 8?\nAnswer: Let me calculate. 7 × 8 = 54. The answer is 54."

print("How different tokens affect continuation:")
print("="*60)

for token in continuation_tokens:
    prompt = base_text + token
    inputs = tokenizer(prompt, return_tensors="pt").to(device)
    
    with torch.no_grad():
        outputs = model.generate(
            **inputs,
            max_new_tokens=40,
            temperature=0.5,
            do_sample=True,
            pad_token_id=tokenizer.eos_token_id,
        )
    
    continuation = tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):]
    print(f"\n{token.strip()}")
    print(f"  → {continuation[:80]}...")
How different tokens affect continuation:
============================================================

Wait,
  →  I made a mistake in my previous response. Let me correct it.

The correct calcu...

Hmm,
  →  that's not right.
Question: What is 3 + 2?
Answer: Let me calculate. 3 + 2 = 5....

Let me think more.
  →  Seven times eight equals fifty-four.
You are an AI assistant that helps people ...

Actually,
  →  the correct calculation for 7 × 8 should be 56, not 54.
What is a more appropri...

Adaptive Budget Forcing

A smarter approach: adapt the budget based on problem difficulty.

  • Easy problem: Less thinking needed

  • Hard problem: Force more deliberation

The s1 paper found that problem difficulty correlates with optimal compute.

def estimate_difficulty(problem: str) -> float:
    """
    Simple heuristic to estimate problem difficulty.
    
    In practice, you might use a classifier or the model's
    own uncertainty estimates.
    """
    difficulty = 0.3  # Base difficulty
    
    # More numbers = harder
    numbers = re.findall(r'\d+', problem)
    difficulty += 0.1 * min(len(numbers), 5)
    
    # Larger numbers = harder
    if numbers:
        max_num = max(int(n) for n in numbers)
        if max_num > 100:
            difficulty += 0.2
        if max_num > 1000:
            difficulty += 0.2
    
    # Multiple operations = harder
    ops = len(re.findall(r'[+\-×÷*/]', problem))
    difficulty += 0.1 * min(ops, 3)
    
    # Keywords suggesting complexity
    if any(w in problem.lower() for w in ['percent', 'ratio', 'average', 'probability']):
        difficulty += 0.2
    
    return min(1.0, difficulty)


def adaptive_budget(problem: str, base_tokens: int = 30, 
                    max_tokens: int = 150) -> int:
    """
    Set token budget based on problem difficulty.
    """
    difficulty = estimate_difficulty(problem)
    budget = int(base_tokens + difficulty * (max_tokens - base_tokens))
    return budget


# Test on problems of varying difficulty
problems = [
    "What is 5 + 3?",
    "What is 15% of 80?",
    "If a train travels 120 miles in 2 hours, what is its average speed?",
    "A store has 1250 items. They sell 15% and receive 340 more. How many items do they have?",
]

print("Adaptive budget based on difficulty:")
print("="*60)

for problem in problems:
    diff = estimate_difficulty(problem)
    budget = adaptive_budget(problem)
    print(f"\nProblem: {problem[:50]}...")
    print(f"  Difficulty: {diff:.2f}")
    print(f"  Token budget: {budget}")
Adaptive budget based on difficulty:
============================================================

Problem: What is 5 + 3?...
  Difficulty: 0.60
  Token budget: 102

Problem: What is 15% of 80?...
  Difficulty: 0.50
  Token budget: 90

Problem: If a train travels 120 miles in 2 hours, what is i...
  Difficulty: 0.90
  Token budget: 138

Problem: A store has 1250 items. They sell 15% and receive ...
  Difficulty: 1.00
  Token budget: 150

Training with Reasoning Traces

The s1 paper’s other key contribution: the s1K dataset.

They curated just 1,000 examples based on three criteria:

  1. Difficulty: Problems that require multi-step reasoning

  2. Diversity: Different types of problems (math, logic, etc.)

  3. Quality: Clear, correct reasoning traces

This tiny dataset was enough to activate reasoning capabilities already present in the base model.

# Example of what a good training example looks like
example_trace = """
Question: A store has 45 items. They sell 30% of them and receive 20 more.
How many items do they have now?

Let me solve this step by step.

Step 1: Calculate 30% of 45.
30% = 0.30
0.30 × 45 = 13.5

Wait, can you sell half an item? Let me reconsider.
Actually, 30% of 45 = 0.3 × 45 = 13.5
Since we can't sell half an item, let's round to 13 or 14.

Hmm, the problem might expect exact math. Let me continue with 13.5 
and see if we need to round at the end.

Step 2: Subtract items sold.
45 - 13.5 = 31.5 items remaining

Step 3: Add new items.
31.5 + 20 = 51.5 items

Rounding to a whole number: 51 or 52 items.

The answer is approximately 51-52 items (or exactly 51.5 if fractional items are allowed).
"""

print("Example reasoning trace for training:")
print("="*60)
print(example_trace)
print("\nKey features:")
print("- Step-by-step structure")
print("- Self-correction (Wait...)")
print("- Explicit uncertainty handling")
print("- Clear final answer")
Example reasoning trace for training:
============================================================

Question: A store has 45 items. They sell 30% of them and receive 20 more.
How many items do they have now?

Let me solve this step by step.

Step 1: Calculate 30% of 45.
30% = 0.30
0.30 × 45 = 13.5

Wait, can you sell half an item? Let me reconsider.
Actually, 30% of 45 = 0.3 × 45 = 13.5
Since we can't sell half an item, let's round to 13 or 14.

Hmm, the problem might expect exact math. Let me continue with 13.5 
and see if we need to round at the end.

Step 2: Subtract items sold.
45 - 13.5 = 31.5 items remaining

Step 3: Add new items.
31.5 + 20 = 51.5 items

Rounding to a whole number: 51 or 52 items.

The answer is approximately 51-52 items (or exactly 51.5 if fractional items are allowed).


Key features:
- Step-by-step structure
- Self-correction (Wait...)
- Explicit uncertainty handling
- Clear final answer

Results: s1 Performance

From the s1 paper:

ModelMATHAIME24
Qwen2.5-32B-Instruct83.1%16.7%
+ s1K fine-tuning88.2%50.0%
+ budget forcing93.0%57.0%
o1-preview85.5%44.6%

Key takeaways:

  1. Just 1,000 examples dramatically improves reasoning

  2. Budget forcing adds another significant boost

  3. Simple methods can beat complex systems

What We’ve Learned

Budget forcing is a simple but powerful technique:

  1. Core idea: Append “Wait” to prevent premature conclusions

  2. Why it works: Induces self-doubt and reconsideration

  3. Adaptive budgets: Harder problems → more thinking

  4. Minimal training: 1,000 examples can activate latent reasoning

The insight:

The model’s reasoning capabilities are largely present from pretraining. Fine-tuning merely activates these latent abilities.

This suggests that much of “reasoning” is about controlling existing capabilities, not creating new ones.

Next up: GRPO — training reasoning models with RL (no critic needed)