Scaling Test-Time Compute for Reasoning (demo paper)

Spend more compute at inference — sample many chains, verify, and vote — to solve multi-step puzzles.

reasoning / chain-of-thoughtsource: mockfaithfulness 80% · halluc 20%

Tier 1 · Understand

Instead of training a bigger model, the paper scales compute at inference time: generate many candidate reasoning chains (best-of-N), score them with a verifier, and aggregate via voting. A single greedy chain often fails multi-step constraint problems; sampling + verification + voting reliably recovers the correct solution under a fixed compute budget.

1Hard puzzles need several reasoning steps that all must line up.
2One quick attempt usually breaks at least one constraint.
3So the model makes many attempts (best-of-N) instead of one.
4A verifier scores each attempt; voting picks the best-supported answer.
5More attempts cost more compute — so there's a budget trade-off.

Concept map

● Concept Map✦ ARCADE

Ask the paper

● Grounded Q&A✦ ARCADE

💬 Ask the paper — answers are grounded in its text, with sections cited

Checking…

Tier 2 & 3 · Play / Prove

template: reasoning · conf 90%

● reasoning✦ ARCADE

🎯 Goal — Schedule all 6 sessions so all 8 constraints pass — under the compute budget.

lostscore 61 · selected 5/8 · compute 2

Drag every talk into a seat, then confirm when all 6 guests are happy.

Target

6 happy

Verifier

5/8

Compute

Score

Attempts1

Verifier

Voting

Compute budget25

Attempt trace1 sampled

Final score61

Paper claim: scaling test-time compute (samples + verifier + voting) finds correct multi-step solutions a single sample misses.