All Roads Lead to Rome: Incentivizing Divergent Thinking in Vision-Language Models
Abstract
Reinforcement Learning enhances Vision-Language Model reasoning but suffers from diversity collapse; a new Multi-Group Policy Optimization method is proposed to encourage diverse thinking patterns.
Recent studies have demonstrated that Reinforcement Learning (RL), notably Group Relative Policy Optimization (GRPO), can intrinsically elicit and enhance the reasoning capabilities of Vision-Language Models (VLMs). However, despite the promise, the underlying mechanisms that drive the effectiveness of RL models as well as their limitations remain underexplored. In this paper, we highlight a fundamental behavioral distinction between RL and base models, where the former engages in deeper yet narrow reasoning, while base models, despite less refined along individual path, exhibit broader and more diverse thinking patterns. Through further analysis of training dynamics, we show that GRPO is prone to diversity collapse, causing models to prematurely converge to a limited subset of reasoning strategies while discarding the majority of potential alternatives, leading to local optima and poor scalability. To address this, we propose Multi-Group Policy Optimization (MUPO), a simple yet effective approach designed to incentivize divergent thinking across multiple solutions, and demonstrate its effectiveness on established benchmarks. Project page: https://xytian1008.github.io/MUPO/
Community
We identify diversity collapse in GRPO-trained VLMs and propose MUPO, a simple drop-in replacement that incentivizes divergent reasoning across multiple solution strategies, achieving new state-of-the-art results (CVPR2026).
Project page👉: https://xytian1008.github.io/MUPO/
Github repo👉: https://github.com/xytian1008/MUPO
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- OPE: Overcoming Information Saturation in Parallel Thinking via Outline-Guided Path Exploration (2026)
- Recycling Failures: Salvaging Exploration in RLVR via Fine-Grained Off-Policy Guidance (2026)
- Does Your Reasoning Model Implicitly Know When to Stop Thinking? (2026)
- FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization (2026)
- Unveiling Implicit Advantage Symmetry: Why GRPO Struggles with Exploration and Difficulty Adaptation (2026)
- Rethinking Token-Level Policy Optimization for Multimodal Chain-of-Thought (2026)
- Thickening-to-Thinning: Reward Shaping via Human-Inspired Learning Dynamics for LLM Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.00479 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper