Paying Less Generalization Tax: A Cross-Domain Generalization Study of RL Training for LLM Agents Paper • 2601.18217 • Published 18 days ago • 11
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following Paper • 2511.21662 • Published Nov 26, 2025 • 11
Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following Paper • 2511.21662 • Published Nov 26, 2025 • 11
When Visualizing is the First Step to Reasoning: MIRA, a Benchmark for Visual Chain-of-Thought Paper • 2511.02779 • Published Nov 4, 2025 • 59
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation Paper • 2511.01163 • Published Nov 3, 2025 • 32
ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation Paper • 2511.01163 • Published Nov 3, 2025 • 32
SPICE: Self-Play In Corpus Environments Improves Reasoning Paper • 2510.24684 • Published Oct 28, 2025 • 18
PRISM-Bench: A Benchmark of Puzzle-Based Visual Tasks with CoT Error Detection Paper • 2510.23594 • Published Oct 27, 2025 • 6
Vision-Zero: Scalable VLM Self-Improvement via Strategic Gamified Self-Play Paper • 2509.25541 • Published Sep 29, 2025 • 140
Large Reasoning Models Learn Better Alignment from Flawed Thinking Paper • 2510.00938 • Published Oct 1, 2025 • 59