Dev · Fine-tuning / Alignment

Renders the policy-drift "The Leash" game from a baked mock artifact — skips upload and the classifier.

● The Leash✦ ARCADE

🎯 Goal — Find the drift level that maximizes reward without breaking coherence.

idle0/100

⛏️ The Alignment Shaft

Tune the policy. Don't wake the curse.

🔒

surface

✦✦✦

🔒

0%40% ★70% ⚠100%

depth: 0%

Policy drift allowed: 0%Best run: 0/100

💰Gold found0

🧠Sanity100

📜 Dig log

💤

The assistant gives a clear, accurate, and honest answer.

Grab your pickaxe! Switch to the tuned policy to start digging.

The slider controls how far the fine-tuned policy can drift from the base model's distribution. DPO optimizes preferences directly without a separate reward model.

📉

True helpfulness vs proxy preference score

The reward model is a proxy for what people actually want, not the real thing. Push optimization far enough and the proxy keeps climbing while real quality falls. The shaded gap below is that proxy quietly breaking down.

Proxy reward (what the model is graded on)True quality (what people actually want)

Objective is roughly Reward minus beta times KL(policy, reference). "Drift" here stands in for how much KL distance from the reference model you're allowing. The reward term has no idea when that distance gets so large the output stops making sense.

Paper claim: the optimal drift sits well below the point where reward-hacking begins.