Neural ArcadeCal Hacks · Ddoski's Lab

Scaling Test-Time Compute for Reasoning (demo paper)

Spend more compute at inference — sample many chains, verify, and vote — to solve multi-step puzzles.

reasoning / chain-of-thoughtsource: mockfaithfulness 80% · halluc 20%

Tier 1 · Understand

Instead of training a bigger model, the paper scales compute at inference time: generate many candidate reasoning chains (best-of-N), score them with a verifier, and aggregate via voting. A single greedy chain often fails multi-step constraint problems; sampling + verification + voting reliably recovers the correct solution under a fixed compute budget.

  1. 1Hard puzzles need several reasoning steps that all must line up.
  2. 2One quick attempt usually breaks at least one constraint.
  3. 3So the model makes many attempts (best-of-N) instead of one.
  4. 4A verifier scores each attempt; voting picks the best-supported answer.
  5. 5More attempts cost more compute — so there's a budget trade-off.

Concept map

Concept Map✦ ARCADE
Multi-step PuzzleSingle SampleBest-of-N SamplingVerifierVotingCompute BudgetCorrect Solution

Ask the paper

Grounded Q&A✦ ARCADE

💬 Ask the paper — answers are grounded in its text, with sections cited

Checking…

Tier 2 & 3 · Play / Prove

template: reasoning · conf 90%
reasoning✦ ARCADE

🎯 Goal — Schedule all 6 sessions so all 8 constraints pass — under the compute budget.

lostscore 61 · selected 5/8 · compute 2
vs

Drag every talk into a seat, then confirm when all 6 guests are happy.

Target
6 happy
Verifier
5/8
Compute
2
Score
61
Attempts1
Verifier
Voting
Compute budget25
Attempt trace1 sampled
Final score61

Paper claim: scaling test-time compute (samples + verifier + voting) finds correct multi-step solutions a single sample misses.