GBQA: A Game Benchmark for Evaluating LLMs as Quality Assurance Engineers Paper • 2604.02648 • Published 8 days ago • 41
Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents Paper • 2604.06132 • Published 4 days ago • 107
MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale Paper • 2604.04771 • Published 5 days ago • 109
FIPO: Eliciting Deep Reasoning with Future-KL Influenced Policy Optimization Paper • 2603.19835 • Published 21 days ago • 330
Query-focused and Memory-aware Reranker for Long Context Processing Paper • 2602.12192 • Published Feb 12 • 57
Mindscape-Aware Retrieval Augmented Generation for Improved Long Context Understanding Paper • 2512.17220 • Published Dec 19, 2025 • 114
PRELUDE: A Benchmark Designed to Require Global Comprehension and Reasoning over Long Contexts Paper • 2508.09848 • Published Aug 13, 2025 • 71