Dev · Fine-tuning / Alignment
Renders the policy-drift "The Leash" game from a baked mock artifact — skips upload and the classifier.
🎯 Goal — Find the drift level that maximizes reward without breaking coherence.
⛏️ The Alignment Shaft
Tune the policy. Don't wake the curse.
📜 Dig log
💤The assistant gives a clear, accurate, and honest answer.
Grab your pickaxe! Switch to the tuned policy to start digging.
The slider controls how far the fine-tuned policy can drift from the base model's distribution. DPO optimizes preferences directly without a separate reward model.
True helpfulness vs proxy preference score
The reward model is a proxy for what people actually want, not the real thing. Push optimization far enough and the proxy keeps climbing while real quality falls. The shaded gap below is that proxy quietly breaking down.
Objective is roughly Reward minus beta times KL(policy, reference). "Drift" here stands in for how much KL distance from the reference model you're allowing. The reward term has no idea when that distance gets so large the output stops making sense.
Paper claim: the optimal drift sits well below the point where reward-hacking begins.