AIRS-Bench: a Suite of Tasks for Frontier AI Research Science Agents Paper • 2602.06855 • Published 6 days ago • 64
Judging What We Cannot Solve: A Consequence-Based Approach for Oracle-Free Evaluation of Research-Level Math Paper • 2602.06291 • Published 7 days ago • 22
MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents Paper • 2602.02474 • Published 10 days ago • 52
MARS: Modular Agent with Reflective Search for Automated AI Research Paper • 2602.02660 • Published 10 days ago • 61
AgentIF-OneDay: A Task-level Instruction-Following Benchmark for General AI Agents in Daily Scenarios Paper • 2601.20613 • Published 15 days ago • 10
PaperBanana: Automating Academic Illustration for AI Scientists Paper • 2601.23265 • Published 13 days ago • 175
OCRVerse: Towards Holistic OCR in End-to-End Vision-Language Models Paper • 2601.21639 • Published 14 days ago • 49
Idea2Story: An Automated Pipeline for Transforming Research Concepts into Complete Scientific Narratives Paper • 2601.20833 • Published 15 days ago • 175
daVinci-Dev: Agent-native Mid-training for Software Engineering Paper • 2601.18418 • Published 17 days ago • 124
Can LLMs Clean Up Your Mess? A Survey of Application-Ready Data Preparation with LLMs Paper • 2601.17058 • Published 21 days ago • 188
ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios Paper • 2601.08620 • Published about 1 month ago • 11
AgencyBench: Benchmarking the Frontiers of Autonomous Agents in 1M-Token Real-World Contexts Paper • 2601.11044 • Published 27 days ago • 34
Paper2Rebuttal: A Multi-Agent Framework for Transparent Author Response Assistance Paper • 2601.14171 • Published 23 days ago • 48
MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents Paper • 2601.12346 • Published 25 days ago • 49
FutureOmni: Evaluating Future Forecasting from Omni-Modal Context for Multimodal LLMs Paper • 2601.13836 • Published 23 days ago • 34