Spaces:
Running
Show ACP agent results in the leaderboard
Summary
The HF Space currently only loads results/{model}/ (default OpenHands runs).
The ACP runs (acp-claude, acp-codex, acp-gemini, openhands_subagents)
live in alternative_agents/{type}/{model}/ in the openhands-index-results
repo and never made it into the dataframe, so the website silently dropped
them. After OpenHands/openhands-index-results#820โ#829 + #830, all the ACP
Claude Code data from the master table in OpenHands/benchmarks#576 is in
the canonical location, but the leaderboard still doesn't show it.
This PR teaches the loader to ingest alternative_agents/ and adds an
Agent column to the leaderboard so OpenHands vs Claude Code vs Codex
vs Gemini CLI are visible at a glance.
Changes
setup_data.pyโ copyalternative_agents/alongsideresults/when fetching the index repo, so all submissions land in the data dir.simple_data_loader.py:- Factor per-directory loading into
_records_from_agent_dirand have_load_from_agent_dirswalk bothresults/andalternative_agents/{type}/{model}/. - Default
agent_nameperagent_type(Claude Code / Codex / Gemini CLI / OpenHands Sub-agents), matching theAGENT_NAME_BY_TYPEmap inOpenHands/evaluation push_to_index_from_archive.py. - Include
agent_nameinagent_id(name_version_model) so an OpenHands run and a Claude Code run on the same SDK version + model don't collide into one row. - Surface
agent_nameon the transformed record.
- Factor per-directory loading into
leaderboard_transformer.py:- Map
agent_nameโ "Agent" in_pretty_column_name. - Insert "Agent" into
base_colsbetweenidandLanguage Model.
- Map
Local verification
Cloned the latest openhands-index-results and pointed the loader at it.
The loader now returns 29 rows: 24 OpenHands + 2 Claude Code + 1 Codex + 2
OpenHands Sub-agents. The new Claude Code rows match the master table in
OpenHands/benchmarks#576:
Claude Code / claude-opus-4-6: swebench 74.4 swtbench 66.7 gaia 66.1 commit0 50.0 swe-bench-multimodal 32.4
Claude Code / claude-sonnet-4-5: swebench 74.4 swtbench 69.3 gaia 63.0 commit0 31.2 swe-bench-multimodal 35.3
Test plan
- Reviewer: load the Space preview built from this PR, confirm the leaderboard table now has an Agent column and shows Claude Code / Codex / OpenHands Sub-agents rows.
- Confirm the existing OpenHands rows look unchanged (same scores, no missing entries).
Two follow-ups on top of the existing PR:
f00a617 โ Show harness in scatter-plot hover
The Alternative Agents scatter labels points with the Language Model's company logo but nothing else, so two points for the same LLM run under different harnesses (claude-sonnet-4-5 under Claude Code vs OpenHands Sub-agents) are visually indistinguishable on hover. Added a Harness: line to format_hover_text that only shows up when the row has an Agent value โ the canonical OpenHands pages drop that column in DataTransformer.view() so this is a no-op there.
55dc87f โ Screenshot of the page in action
Added docs/screenshots/alternative-agents.png so reviewers can see the live state after the ACP benchmark data restoration (openhands-index-results#832โ#843): all seven harness ร model rows render complete, the scatter plot shows the populated cost-vs-score frontier, and the "Show incomplete entries" toggle reports zero by default.
