ClawBench: Can AI Agents Complete Everyday Online Tasks? Paper • 2604.08523 • Published 6 days ago • 253
QiMeng-PRepair: Precise Code Repair via Edit-Aware Reward Optimization Paper • 2604.05963 • Published 8 days ago • 8
SWE-Factory: Your Automated Factory for Issue Resolution Training Data and Evaluation Benchmarks Paper • 2506.10954 • Published Jun 12, 2025 • 54
PhyX: Does Your Model Have the "Wits" for Physical Reasoning? Paper • 2505.15929 • Published May 21, 2025 • 49