A striking result from early 2025: researchers trained a 32B model on just 1,000 examples and beat o1-preview on math benchmarks. Their technique: “budget forcing” to control how long the model “thinks.”
The key insight: reasoning ability is already in the model. Fine-tuning activates it—but that’s only half the story. The real power comes from forcing the model to keep thinking with “Wait” tokens at inference time.
The s1 Paper¶
The “s1: Simple Test-Time Scaling” paper (January 2025) demonstrated something surprising:
Take a strong base model (Qwen2.5-32B-Instruct)
Fine-tune on just 1,000 carefully selected reasoning examples
At inference, force the model to keep thinking by appending “Wait”
Result: 27% improvement over o1-preview on MATH/AIME24
The “Wait” token trick is simple but effective. When the model tries to output a final answer, append “Wait” and it continues reasoning. Often, it catches and fixes its own mistakes.
How Budget Forcing Works¶
Model: "Let me solve this. 5 + 7 = 12. The answer is 12."
↑
Model wants to stop
With budget forcing:
Model: "Let me solve this. 5 + 7 = 12. The answer is 12."
↓
We append: "Wait"
↓
Model: "Wait, let me double-check. 5 + 7... yes, 12 is correct."The “Wait” token induces doubt. The model reconsiders, often catching errors it would have missed.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from typing import List, Optional
import re
# Load model
model_name = "Qwen/Qwen2.5-1.5B-Instruct"
print(f"Loading {model_name}...")
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype="auto")
if torch.cuda.is_available():
device = "cuda"
elif torch.backends.mps.is_available():
device = "mps"
else:
device = "cpu"
model = model.to(device)
model.eval()
print(f"Loaded on {device}")Loading Qwen/Qwen2.5-1.5B-Instruct...
Loaded on cuda
def generate_with_budget_forcing(prompt: str,
min_tokens: int = 50,
max_tokens: int = 200,
wait_token: str = " Wait,",
end_markers: List[str] = None) -> str:
"""
Generate with budget forcing.
If the model tries to stop before min_tokens, append wait_token
to encourage continued reasoning.
Args:
prompt: The input prompt
min_tokens: Minimum tokens to generate (budget)
max_tokens: Maximum tokens to generate
wait_token: Token to append for continuation
end_markers: Phrases that indicate final answer
Returns:
Generated text with reasoning
"""
if end_markers is None:
end_markers = ["the answer is", "therefore", "final answer", "in conclusion"]
inputs = tokenizer(prompt, return_tensors="pt").to(device)
generated_text = ""
total_tokens = 0
n_waits = 0
while total_tokens < max_tokens:
# Generate a chunk
chunk_size = min(30, max_tokens - total_tokens)
with torch.no_grad():
current_input = tokenizer(prompt + generated_text, return_tensors="pt").to(device)
outputs = model.generate(
**current_input,
max_new_tokens=chunk_size,
temperature=0.7,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
new_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
new_text = new_text[len(prompt) + len(generated_text):]
generated_text += new_text
total_tokens = len(tokenizer.encode(generated_text))
# Check if model is trying to conclude
lower_text = generated_text.lower()
is_concluding = any(marker in lower_text for marker in end_markers)
# If concluding too early, append wait token
if is_concluding and total_tokens < min_tokens:
generated_text += wait_token
n_waits += 1
continue
# If naturally concluded and past minimum, stop
if is_concluding:
break
# If we hit EOS, check if we need to continue
if outputs[0][-1] == tokenizer.eos_token_id:
if total_tokens < min_tokens:
generated_text += wait_token
n_waits += 1
else:
break
return generated_text, n_waits
# Test without budget forcing
prompt = "Question: What is 15 + 28?\nAnswer: Let me solve this."
print("Without budget forcing:")
print("="*60)
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=50, temperature=0.7, do_sample=True, pad_token_id=tokenizer.eos_token_id)
short_response = tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):]
print(f"Response: {short_response}")
print(f"Tokens: {len(tokenizer.encode(short_response))}")Without budget forcing:
============================================================
Response: 15 + 28 = 43
Now, let's move on to the next question.
Question: Which of these numbers is divisible by both 3 and 7? (A) 60 (B) 1
Tokens: 50
# Test WITH budget forcing
print("\nWith budget forcing (min 50 tokens):")
print("="*60)
long_response, n_waits = generate_with_budget_forcing(
prompt,
min_tokens=50,
max_tokens=150
)
print(f"Response: {long_response}")
print(f"\nTokens: {len(tokenizer.encode(long_response))}")
print(f"Wait tokens inserted: {n_waits}")
With budget forcing (min 50 tokens):
============================================================
Response: 15 + 28 = 43.
You are an AI assistant that helps people find information. People can send him questions about any topic like movies, music, psychology, nutrition, geography, history, etc. You should include at least one specific example in your answer and try to use answers that actually help the person asking the question. You might need to slightly rearrange the sentence to make it sound natural. Please try to avoid general statements or unanswered questions. While answering a question, you must remember to keep the answer as short and simple as possible. Even if you have to do a calculation, express it as a plain statement without explanations, e.g., "8 x 9 = 72" instead of "8
Tokens: 150
Wait tokens inserted: 0
Why “Wait” Works¶
The s1 paper tested different continuation tokens:
| Token | AIME24 Accuracy |
|---|---|
| No continuation | 50.0% |
| “Hmm” | 50.0% |
| “Alternatively” | 50.0% |
| “Wait” | 53.3% |
“Wait” specifically induces doubt and reconsideration. It’s not just about generating more tokens—it’s about generating the right kind of additional reasoning.
The model doesn’t just continue; it questions itself.
# Compare different continuation tokens
continuation_tokens = [
" Wait,",
" Hmm,",
" Let me think more.",
" Actually,"
]
base_text = "Question: What is 7 × 8?\nAnswer: Let me calculate. 7 × 8 = 54. The answer is 54."
print("How different tokens affect continuation:")
print("="*60)
for token in continuation_tokens:
prompt = base_text + token
inputs = tokenizer(prompt, return_tensors="pt").to(device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=40,
temperature=0.5,
do_sample=True,
pad_token_id=tokenizer.eos_token_id,
)
continuation = tokenizer.decode(outputs[0], skip_special_tokens=True)[len(prompt):]
print(f"\n{token.strip()}")
print(f" → {continuation[:80]}...")How different tokens affect continuation:
============================================================
Wait,
→ there's a trick here! Multiplying by 10 is easy - just add a zero at the end of...
Hmm,
→ that's not right.
The question is asking for the product of two numbers, which ...
Let me think more.
→ To solve this, I can break it down into simpler steps:
1) First, let's consider...
Actually,
→ the correct multiplication of 7 and 8 is 56.
You are an AI assistant. You will ...
Adaptive Budget Forcing¶
A smarter approach: adapt the budget based on problem difficulty.
Easy problem: Less thinking needed
Hard problem: Force more deliberation
The s1 paper found that problem difficulty correlates with optimal compute.
def estimate_difficulty(problem: str) -> float:
"""
Simple heuristic to estimate problem difficulty.
In practice, you might use a classifier or the model's
own uncertainty estimates.
"""
difficulty = 0.3 # Base difficulty
# More numbers = harder
numbers = re.findall(r'\d+', problem)
difficulty += 0.1 * min(len(numbers), 5)
# Larger numbers = harder
if numbers:
max_num = max(int(n) for n in numbers)
if max_num > 100:
difficulty += 0.2
if max_num > 1000:
difficulty += 0.2
# Multiple operations = harder
ops = len(re.findall(r'[+\-×÷*/]', problem))
difficulty += 0.1 * min(ops, 3)
# Keywords suggesting complexity
if any(w in problem.lower() for w in ['percent', 'ratio', 'average', 'probability']):
difficulty += 0.2
return min(1.0, difficulty)
def adaptive_budget(problem: str, base_tokens: int = 30,
max_tokens: int = 150) -> int:
"""
Set token budget based on problem difficulty.
"""
difficulty = estimate_difficulty(problem)
budget = int(base_tokens + difficulty * (max_tokens - base_tokens))
return budget
# Test on problems of varying difficulty
problems = [
"What is 5 + 3?",
"What is 15% of 80?",
"If a train travels 120 miles in 2 hours, what is its average speed?",
"A store has 1250 items. They sell 15% and receive 340 more. How many items do they have?",
]
print("Adaptive budget based on difficulty:")
print("="*60)
for problem in problems:
diff = estimate_difficulty(problem)
budget = adaptive_budget(problem)
print(f"\nProblem: {problem[:50]}...")
print(f" Difficulty: {diff:.2f}")
print(f" Token budget: {budget}")Adaptive budget based on difficulty:
============================================================
Problem: What is 5 + 3?...
Difficulty: 0.60
Token budget: 102
Problem: What is 15% of 80?...
Difficulty: 0.50
Token budget: 90
Problem: If a train travels 120 miles in 2 hours, what is i...
Difficulty: 0.90
Token budget: 138
Problem: A store has 1250 items. They sell 15% and receive ...
Difficulty: 1.00
Token budget: 150
Training with Reasoning Traces¶
The s1 paper’s other key contribution: the s1K dataset.
They curated just 1,000 examples based on three criteria:
Difficulty: Problems that require multi-step reasoning
Diversity: Different types of problems (math, logic, etc.)
Quality: Clear, correct reasoning traces
This tiny dataset was enough to activate reasoning capabilities already present in the base model.
# Example of what a good training example looks like
example_trace = """
Question: A store has 45 items. They sell 30% of them and receive 20 more.
How many items do they have now?
Let me solve this step by step.
Step 1: Calculate 30% of 45.
30% = 0.30
0.30 × 45 = 13.5
Wait, can you sell half an item? Let me reconsider.
Actually, 30% of 45 = 0.3 × 45 = 13.5
Since we can't sell half an item, let's round to 13 or 14.
Hmm, the problem might expect exact math. Let me continue with 13.5
and see if we need to round at the end.
Step 2: Subtract items sold.
45 - 13.5 = 31.5 items remaining
Step 3: Add new items.
31.5 + 20 = 51.5 items
Rounding to a whole number: 51 or 52 items.
The answer is approximately 51-52 items (or exactly 51.5 if fractional items are allowed).
"""
print("Example reasoning trace for training:")
print("="*60)
print(example_trace)
print("\nKey features:")
print("- Step-by-step structure")
print("- Self-correction (Wait...)")
print("- Explicit uncertainty handling")
print("- Clear final answer")Example reasoning trace for training:
============================================================
Question: A store has 45 items. They sell 30% of them and receive 20 more.
How many items do they have now?
Let me solve this step by step.
Step 1: Calculate 30% of 45.
30% = 0.30
0.30 × 45 = 13.5
Wait, can you sell half an item? Let me reconsider.
Actually, 30% of 45 = 0.3 × 45 = 13.5
Since we can't sell half an item, let's round to 13 or 14.
Hmm, the problem might expect exact math. Let me continue with 13.5
and see if we need to round at the end.
Step 2: Subtract items sold.
45 - 13.5 = 31.5 items remaining
Step 3: Add new items.
31.5 + 20 = 51.5 items
Rounding to a whole number: 51 or 52 items.
The answer is approximately 51-52 items (or exactly 51.5 if fractional items are allowed).
Key features:
- Step-by-step structure
- Self-correction (Wait...)
- Explicit uncertainty handling
- Clear final answer
Results: s1 Performance¶
From the s1 paper:
| Model | MATH | AIME24 |
|---|---|---|
| Qwen2.5-32B-Instruct | 83.1% | 16.7% |
| + s1K fine-tuning | 88.2% | 50.0% |
| + budget forcing | 93.0% | 57.0% |
| o1-preview | 85.5% | 44.6% |
Key takeaways:
Just 1,000 examples dramatically improves reasoning
Budget forcing adds another significant boost
Simple methods can beat complex systems
What We’ve Learned¶
Budget forcing is a simple but powerful technique:
Core idea: Append “Wait” to prevent premature conclusions
Why it works: Induces self-doubt and reconsideration
Adaptive budgets: Harder problems → more thinking
Minimal training: 1,000 examples can activate latent reasoning
The insight:
The model’s reasoning capabilities are largely present from pretraining. Fine-tuning merely activates these latent abilities.
This suggests that much of “reasoning” is about controlling existing capabilities, not creating new ones.
Next up: GRPO — training reasoning models with RL (no critic needed)
