One-Shot RLVR Review: Policy Selection over Latent Reasoning

Here’s a result that should make you uncomfortable: train an LLM with RLVR on a single math example, and it improves across dozens of math benchmarks. Qwen2.5-Math-1.5B goes from ~36% to over 70% on MATH500 — approaching what you’d get from a full dataset.

One example cannot possibly contain enough mathematical knowledge to explain this. So what’s actually happening?

The Interesting Interpretation

The obvious reading is “one example is enough to teach reasoning.” But that’s the wrong frame. A single training problem cannot carry broad mathematical content. The stronger interpretation is that RLVR is not primarily teaching new reasoning — it’s selecting over reasoning modes that already exist in the base model.

The base model already knows how to do chain-of-thought, how to set up equations, how to verify intermediate steps. It just doesn’t reliably choose to do these things. Its output distribution is spread across many behaviors: some good reasoning, some repetition loops, some formatting failures, some truncated attempts.

What one-shot RLVR does is provide a verifiable anchor — a single correctness constraint. Combined with sampling and entropy, this creates enough reward signal for policy gradient to shift probability mass toward the reasoning patterns that work. Not new knowledge. Redistribution of existing capability.

The Evidence

The training example isn’t hard. The chosen example ( $\pi_1$ ) is one the base model can already partially solve. It often gets the reasoning steps right but fails at the final answer, formatting, or gets stuck in repetition. This weakens the “RL learns new reasoning from the example” story. The example is better understood as exposing instability in the model’s existing distribution.

Generalization is broad but suspicious. Performance improves across many math categories and even some non-math benchmarks. This makes more sense as a global behavioral shift — more structured output, fewer repetition loops, better answer formatting, longer sustained reasoning — than as mathematical learning transferred from one problem.

Post-saturation generalization. Training accuracy on the single example saturates within a few steps, but test accuracy keeps improving afterward. This is the paper’s most conceptually interesting finding. In ordinary supervised learning, once you’ve memorized the one training point, there’s no reason for generalization to improve. The authors argue that entropy keeps exploration alive: even at near-100% training accuracy, occasional wrong rollouts create reward variance, so GRPO still has non-zero gradient signal. That’s plausible for why gradients don’t vanish — but it doesn’t fully explain why those gradients improve general reasoning rather than merely overfitting.

Format correction is a big confound. This is the most important caveat. The appendix shows that a format-only reward (no correctness checking) already produces a large improvement. Qwen2.5-Math models have substantial repetition and answer-extraction issues. Fixing \boxed{} formatting and reducing repeated tails accounts for a significant chunk of the benchmark gains. Outcome reward still outperforms format reward, so there’s a residual non-format effect — but the honest decomposition is:

1-shot RLVR = format correction + reasoning-mode selection + exploration effects.

Not pure reasoning emergence.

Label correctness matters less than expected. Training with slightly wrong labels barely hurts. Completely unguessable wrong labels produce results similar to entropy-only training — the reward signal effectively disappears and you’re left with format/entropy effects. But guessable wrong labels can hurt, because the model may learn to produce the wrong answer. This confirms that the reward isn’t cleanly “teaching correctness.” It’s providing a selection pressure that happens to correlate with good reasoning.

The Mechanism

A compact picture:

The base model already contains many reasoning modes. One verifiable example anchors correctness. Sampling and entropy explore nearby reasoning trajectories. Policy gradient upweights successful ones. Global output behavior shifts toward more useful reasoning and format modes.

The single example is not a source of knowledge. It’s a selection pressure.

What I Think

The paper’s diagnostic contribution — showing that tiny-data RLVR produces surprisingly large gains and using this to probe what RLVR actually does — is strong. The mechanistic explanation is incomplete but provocative.

The main limitations:

Format correction explains a large fraction of the improvement, and the paper doesn’t cleanly separate it from reasoning gains.
Entropy-only training also produces gains for a few steps, suggesting that some of the effect is just perturbing the model out of bad output habits.
Post-saturation generalization is observed but not fully explained.
Results depend heavily on how much latent capability the base model already has. This is a story about Qwen2.5-Math — a model that was pre-trained on math. Running 1-shot RLVR on a model without strong math pretraining would likely show much smaller gains.

The most interesting open question is about sample selection: what makes a single example a good anchor for activating useful reasoning modes? The paper hints that good examples need enough CoT complexity and exploration space — simply training on the hardest substep performs worse. That suggests the best RLVR data may not be the hardest data, but the data that creates the richest useful variation under sampling.

This connects to a theme across recent RL-for-reasoning work: the training signal matters less for its content and more for its ability to create reward-discriminative sampling — enough contrast between successful and unsuccessful trajectories for policy gradient to do meaningful work. RAGEN-2’s reward-variance filtering is essentially the same insight from a different angle.