Open Agent Evaluation Laboratory

university

https://boxiyu.github.io/

boxi-yu-194b63279

AI & ML interests

Code Agent, Benchmark Augmentation

CWCY

updated 2 datasets 2 months ago

OpenAgentLab/SWE-bench_Pro-ABS

Viewer • Updated Mar 5 • 731 • 25

OpenAgentLab/SWE-Bench_Verified_ABS

Viewer • Updated Mar 5 • 500 • 107

CWCY

published a dataset 2 months ago

OpenAgentLab/SWE-bench_Pro-ABS

Viewer • Updated Mar 5 • 731 • 25

authored 2 papers 11 months ago

How Should I Build A Benchmark? Revisiting Code-Related Benchmarks For LLMs

Paper • 2501.10711 • Published Jan 18, 2025 • 1

UTBoost: Rigorous Evaluation of Coding Agents on SWE-Bench

Paper • 2506.09289 • Published Jun 10, 2025 • 2