← Blog

RLVR Is Not Just Reward Learning

TL;DR

If random rewards can improve a math model, benchmark gains alone don't prove reasoning acquisition. RLVR may work less by teaching new reasoning and more by amplifying behaviors already latent in the pretrained model — filtered through GRPO clipping bias.

The standard story about RLVR goes like this: a verifier provides correct reward signals, and RL uses those signals to teach the model better reasoning. Spurious Rewards makes that story much harder to accept.

The paper shows that format rewards, incorrect labels, majority-vote rewards, and even random rewards can all improve a math-specialized model’s benchmark performance. If that’s the case, then benchmark improvement alone cannot prove that RL is learning reasoning from clean reward supervision.

My reading is that the paper is really about prior amplification. RLVR may not primarily create reasoning behaviors from scratch. Instead, it selects, stabilizes, and amplifies behaviors already latent in the pretrained model.

Spurious Rewards as Diagnostic Probes

The paper’s various reward settings should not be understood as proposed training objectives. They are probes for a causal question: how much of RLVR improvement actually depends on reward correctness?

The answer: less than expected, at least for strong math-specialized models like Qwen2.5-Math.

This matters because RLVR improvement can come from many sources — learning from answer correctness, enforcing output format, increasing response length, shifting toward code-style reasoning, amplifying high-prior behaviors, exploiting optimizer bias, changing the sampling distribution. The reward signal may be part of the story, but it is not automatically the whole story.

Why Random Reward Works: The Clipping Mechanism

Among the spurious reward settings, random reward is the sharpest test. Format rewards contain structure. Majority-vote rewards reflect the model’s own prior. Wrong-label rewards may carry accidental signal through format or distribution. Random reward is cleaner — if the reward is independent of the output, expected advantage should be zero under an unclipped objective.

The paper’s explanation is that GRPO clipping introduces a bias. The key intuition:

Standard clipping makes it hard for low-probability tokens to increase substantially, while high-probability tokens are much less constrained.

If a token has old probability 0.010.01, then with ϵ=0.2\epsilon = 0.2, its clipped upper probability is only 0.0120.012. But if a token has probability 0.90.9, the upper bound is 1.081.08 — effectively unconstrained since probabilities cap at 1.

So random reward plus GRPO clipping does not behave like neutral noise. It is filtered through the geometry of the old policy. In practice, this acts like a pretrained-prior amplifier. The reward isn’t teaching the model math. The optimizer is reshaping the policy around behaviors the model already tends to produce. If those behaviors are useful, performance improves.

Code Reasoning as the Mediating Variable

The most persuasive empirical story is the shift toward code reasoning. For Qwen2.5-Math, code-style reasoning is already a strong prior and appears more accurate than natural-language reasoning. Spurious rewards don’t need to discover a new reasoning algorithm — they only need to make the model use an already useful mode more often.

This explains the model-dependence. A model with a good latent code-reasoning prior benefits from prior amplification. A model without that prior should not benefit in the same way.

This connects directly to one-shot RLVR. Both results point toward the same interpretation: RLVR often elicits capabilities that are already present but underused. One-shot RLVR shows that very little task diversity may be needed. Spurious Rewards goes further — even reward semantics may be less essential than expected. Together, they weaken the view that RLVR primarily teaches reasoning from data.

When Should RLVR Trust the Reward?

The clipping story becomes more interesting when compared with DAPO’s Clip-Higher strategy. DAPO argues that standard clipping is too restrictive for low-probability exploration tokens — it decouples the upper threshold, giving rare tokens more room to increase. At first, this seems to contradict Spurious Rewards. DAPO says: let low-probability tokens grow. Spurious Rewards says: standard clipping’s high-prior bias can improve performance.

But these views are not contradictory. They expose the same clipping tradeoff under different reward trust regimes:

  • Trustworthy reward: a low-probability token in a successful trajectory may be a genuine reasoning discovery — a novel decomposition, a self-check, a code-style move. Standard clipping is too conservative here. DAPO’s Clip-Higher makes sense.
  • Noisy or random reward: a low-probability token with positive advantage may just be lucky. Allowing it to increase aggressively amplifies noise. Standard clipping acts as a prior-preserving regularizer.

The real question is not “should RL reinforce high-prior or low-prior tokens?” It’s “how trustworthy is the reward signal?”

And in practice, reward quality is not binary. Real RLVR training lives on a continuum — verifier rewards can be partially correct, format-biased, sparse, or hackable. Most training sits somewhere between “perfectly clean” and “fully random.” The clipping tradeoff flips somewhere along that continuum, and we do not currently have good methods for detecting where.

When reward is weak and the model falls back on high-prior linguistic templates, the result can be what RAGEN-2 calls template collapse — fluent but input-agnostic reasoning. That is prior amplification gone wrong: the model stabilizes around reusable patterns that look like reasoning but carry no information about the actual input.

This suggests a clean experiment:

Standard GRPODAPO Clip-Higher
Random reward(A) baseline(B) test
Clean verifier reward(C) baseline(D) test

If (B) is worse than (A): supports the clipping-bias / prior-amplification explanation — Clip-Higher weakens the mechanism that protects the pretrained prior from noisy updates. If (D) is better than (C): supports DAPO’s rare-exploration argument — clean reward + looser clipping lets genuine discoveries propagate. If (B) is still better than (A): challenges the idea that Spurious Rewards’ gains are mainly from clipping bias, and points to other mechanisms.

The central open problem is not whether RLVR works, but when its gains come from reward learning and when they come from prior amplification. Current benchmark gains alone cannot distinguish the two.

What the Paper Doesn’t Prove

The paper does not prove that RLVR never teaches reasoning. Correct reward still matters — a verifier selects better trajectories, rejects bad habits, and gradually reshapes the policy. The more careful conclusion is:

RLVR improvement is not sufficient evidence that reward semantics caused reasoning improvement.

Future RLVR papers should include stronger controls: show what your method adds beyond format reward, random reward, wrong-label reward, and prior amplification.

Another limitation: the mechanisms behind different spurious rewards are probably not the same. Random reward relies on clipping bias. Format reward works through output regularization. Majority reward approximates self-consistency. Grouping them under “spurious rewards” is useful rhetorically but mechanistically coarse. The paper is strongest when explaining random reward, and weaker when treating all spurious rewards as one unified phenomenon.

Takeaway

The paper pushes toward a more honest decomposition:

RLVR = reward learning + optimizer bias + pretrained priors + sampling dynamics.

LLMs are not blank policies. They already contain rich reasoning behaviors from pretraining. RL can expose these behaviors, increase their frequency, and stabilize them. That can look like learning, even when the reward signal carries little semantic information.

The important question for reasoning research is no longer just “does RLVR improve accuracy?” It should be: which latent behavior did RL amplify, and why did the training objective select that behavior?

That tradeoff — deciding when to trust the reward enough to move beyond the prior, and when to treat the prior as the only reliable signal left — feels like one of the real open problems in RLVR.