Papers
arxiv:2604.06268

RAGEN-2: Reasoning Collapse in Agentic RL

Published on Apr 7
· Submitted by
Zihan Wang
on Apr 9
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,
,
,
,

Abstract

Research identifies template collapse in multi-turn LLM agents as a hidden failure mode undetectable by entropy, proposing mutual information proxies and SNR-aware filtering to improve reasoning quality and task performance.

AI-generated summary

RL training of multi-turn LLM agents is inherently unstable, and reasoning quality directly determines task performance. Entropy is widely used to track reasoning stability. However, entropy only measures diversity within the same input, and cannot tell whether reasoning actually responds to different inputs. In RAGEN-2, we find that even with stable entropy, models can rely on fixed templates that look diverse but are input-agnostic. We call this template collapse, a failure mode invisible to entropy and all existing metrics. To diagnose this failure, we decompose reasoning quality into within-input diversity (Entropy) and cross-input distinguishability (Mutual Information, MI), and introduce a family of mutual information proxies for online diagnosis. Across diverse tasks, mutual information correlates with final performance much more strongly than entropy, making it a more reliable proxy for reasoning quality. We further explain template collapse with a signal-to-noise ratio (SNR) mechanism. Low reward variance weakens task gradients, letting regularization terms dominate and erase cross-input reasoning differences. To address this, we propose SNR-Aware Filtering to select high-signal prompts per iteration using reward variance as a lightweight proxy. Across planning, math reasoning, web navigation, and code execution, the method consistently improves both input dependence and task performance.

Community

This comment has been hidden

the move to diagnose reasoning collapse with cross-input mutual information is the right diagnostic shift beyond entropy. i like the online mi proxies and the snr framing, since you can test input dependence without external models and still get a gradient signal to drive updates. one good question: how robust are those mi proxies to prompt distribution shifts or varying prompt lengths, could a biased prompt set give a false sense of cross-input dependence? the arxivlens breakdown helped me parse the method details, the summary does a nice job unpacking the entropy vs mi distinction: https://arxivlens.com/PaperView/Details/ragen-2-reasoning-collapse-in-agentic-rl-5639-ec2d0d31. overall, this clean separation plus simple snr-aware filtering looks like a practical knob for improving input-conditioned planning across domains, curious how it scales to longer horizons or stronger sparse reward regimes.

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.06268
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.06268 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.06268 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.06268 in a Space README.md to link it from this page.

Collections including this paper 2