← Blog

Template Collapse in Agentic RL

TL;DR

An LLM agent can produce fluent, diverse reasoning that stops depending on the input. Entropy won't catch it — you need to measure input-dependence directly. The paper's diagnostic is more valuable than its fix.

RAGEN-2 introduces a failure mode called template collapse: an RL-trained LLM agent keeps producing fluent, diverse reasoning traces, but those traces stop depending on the input. Entropy stays high. Performance degrades. Standard monitoring misses it entirely.

The paper’s diagnostic contribution is genuinely useful. Its optimization story is less convincing. Here’s why.

The Core Idea: Entropy Is the Wrong Comfort Signal

When we train LLM agents with RL, we often monitor reasoning entropy. High entropy means the model is still exploring, still producing varied outputs. That sounds reassuring. RAGEN-2 argues it can be misleading.

The key decomposition is:

H(Z)=I(X;Z)+H(ZX)H(Z) = I(X; Z) + H(Z \mid X)

where XX is the input (prompt, environment state), ZZ is the reasoning trace, H(ZX)H(Z \mid X) is within-input diversity (how varied the reasoning is for the same input), and I(X;Z)I(X; Z) is cross-input distinguishability (whether the reasoning carries information about which input produced it).

The failure: entropy H(Z)H(Z) can remain high because H(ZX)H(Z \mid X) remains high — the model still produces varied text — while I(X;Z)I(X; Z) collapses. The reasoning no longer reflects the input. It becomes a reusable template: fluent, diverse-looking, but input-agnostic.

This is the most valuable idea in the paper. It shifts the diagnostic question from “is the model still producing diverse text?” to “does the text still contain information about the environment state?”

How They Measure It

RAGEN-2 proposes a practical measurement: in-batch cross-scoring. Given PP prompts and GG reasoning samples per prompt, take each reasoning trace generated from prompt XiX_i and score its likelihood under every other prompt XjX_j in the batch.

This produces two scores:

  • Matched score: how likely is a reasoning trace under its original prompt? (Should be high.)
  • Marginal score: how likely is it across all prompts in the batch? (Should be lower, if the reasoning is input-specific.)

If the gap between matched and marginal is large, the reasoning is specific to its input. If the gap is small, the reasoning is generic — it works equally well under any prompt.

They also define Retrieval-Acc: given a reasoning trace, can we retrieve the original prompt that produced it? This is the most intuitive metric. If reasoning is input-grounded, matching it back to the correct prompt should be easy. If reasoning is generic, retrieval drops to chance.

An important limitation: Retrieval-Acc measures input specificity, not correctness. A reasoning trace can be highly specific but wrong — for instance, a VLM that hallucinates a specific spatial configuration that happens to match a different board state. So low Retrieval-Acc signals template collapse, but high Retrieval-Acc doesn’t guarantee faithful reasoning.

Where I’m Less Convinced: SNR as Explanation

Section 3 of the paper tries to explain why template collapse happens. The proposed mechanism is signal-to-noise ratio: if reward variance within a prompt is low, the task gradient is weak. Regularization dominates. The model drifts toward reusable templates.

The proposed fix, SNR-Aware Filtering, follows directly: sample multiple rollouts per prompt, compute reward variance, and prioritize high-variance prompts for training.

This is the weakest part conceptually. The core claim is close to a tautology of policy-gradient methods: if rewards don’t distinguish trajectories, advantage estimates contain little learning signal. In PPO/GRPO-style training, if all sampled trajectories for a prompt receive the same reward, the relative advantage is near zero. Of course the task gradient is weak. Calling this an SNR mechanism is mathematically valid but not deeply explanatory.

The deeper question isn’t “why does low reward variance produce weak gradients?” — that’s expected. The deeper question is: why does weak task signal in LLM agents produce fluent, reusable, input-agnostic reasoning templates instead of ordinary random failure? That more interesting question likely involves language priors, format rewards, CoT imitation, and cheap linguistic variation. The paper doesn’t really address it.

The curriculum learning connection

Reward variance is a local optimization signal, not a semantic measure of prompt usefulness. Low reward variance can mean at least three different things:

CaseWhat happensCorrect treatment
Too easyall rollouts correctdownsample is fine
Too hardall rollouts wrongneeds curriculum, decomposition, process reward
Bad reward designdifferent behaviors get same rewardfix reward / verifier

SNR-Aware Filtering treats these cases similarly, but they require different responses. The method is better understood as: select prompts that currently produce reward-discriminative samples. This is very close to curriculum learning, self-paced learning, and hard-example mining — reasonable engineering, but not a new algorithmic principle.

The Deeper Issue: Information Flow Beyond X → Z

The most interesting framing isn’t just about SNR. A stronger unified view:

LLM agents can optimize cheap linguistic proxies while losing task-relevant information flow from input to reasoning to action.

Healthy agent reasoning should preserve the chain:

XZAX \rightarrow Z \rightarrow A

Input/state → reasoning → action. The failure modes are:

FailureWhat stays highWhat collapses
Template collapsesurface reasoning diversityI(X;Z)I(X; Z)
Entropy-gamingtoken entropytask-relevant strategy diversity
VLM grounding failurefluent spatial descriptionsvisual-state grounding
Action oscillationrepeated valid outputsinput-conditioned planning

RAGEN-2 mainly focuses on I(X;Z)I(X; Z). It doesn’t fully address whether reasoning actually controls action — I(Z;A)I(Z; A) — or whether action policies remain state-conditioned — I(X;A)I(X; A).

For VLM agents specifically, the most interesting failure may be visually ungrounded template collapse: the agent preserves fluent chain-of-thought and high surface diversity, but loses mutual information with the visual state and collapses into repetitive or oscillatory action policies.

A more general research direction would be to diagnose and preserve task-relevant mutual information across the full agent pipeline: I(X;Z)I(X; Z), I(Z;A)I(Z; A), and I(X;A)I(X; A).

What I Take Away

Genuinely useful:

  1. The term template collapse is memorable and well-defined.
  2. The distinction between entropy and input-dependence is important for anyone doing agentic RL.
  3. Retrieval-Acc and matched-minus-marginal are practical diagnostics worth adopting.
  4. The experiments show this failure appears across multiple agentic RL settings.

Overpackaged:

  1. SNR as explanation largely restates known policy-gradient limitations.
  2. Reward-variance filtering resembles curriculum / self-paced learning.
  3. Filtering skips weak-signal prompts rather than making them learnable through process rewards, step-level verification, or better credit assignment.

The paper is valuable because it formalizes a hidden LLM-agent failure mode: reasoning can remain fluent and diverse while losing input-dependence. The SNR explanation and reward-variance filtering are much less novel. The deeper open question — why weak task gradients in LLM agents get absorbed by cheap linguistic templates rather than converted into grounded strategy learning — remains unanswered.