Show ACP agent results in the leaderboard

#11

Summary

The HF Space currently only loads results/{model}/ (default OpenHands runs).
The ACP runs (acp-claude, acp-codex, acp-gemini, openhands_subagents)
live in alternative_agents/{type}/{model}/ in the openhands-index-results
repo and never made it into the dataframe, so the website silently dropped
them. After OpenHands/openhands-index-results#820โ€“#829 + #830, all the ACP
Claude Code data from the master table in OpenHands/benchmarks#576 is in
the canonical location, but the leaderboard still doesn't show it.

This PR teaches the loader to ingest alternative_agents/ and adds an
Agent column to the leaderboard so OpenHands vs Claude Code vs Codex
vs Gemini CLI are visible at a glance.

Changes

  • setup_data.py โ€” copy alternative_agents/ alongside results/ when fetching the index repo, so all submissions land in the data dir.
  • simple_data_loader.py:
    • Factor per-directory loading into _records_from_agent_dir and have _load_from_agent_dirs walk both results/ and alternative_agents/{type}/{model}/.
    • Default agent_name per agent_type (Claude Code / Codex / Gemini CLI / OpenHands Sub-agents), matching the AGENT_NAME_BY_TYPE map in OpenHands/evaluation push_to_index_from_archive.py.
    • Include agent_name in agent_id (name_version_model) so an OpenHands run and a Claude Code run on the same SDK version + model don't collide into one row.
    • Surface agent_name on the transformed record.
  • leaderboard_transformer.py:
    • Map agent_name โ†’ "Agent" in _pretty_column_name.
    • Insert "Agent" into base_cols between id and Language Model.

Local verification

Cloned the latest openhands-index-results and pointed the loader at it.
The loader now returns 29 rows: 24 OpenHands + 2 Claude Code + 1 Codex + 2
OpenHands Sub-agents. The new Claude Code rows match the master table in
OpenHands/benchmarks#576:

Claude Code / claude-opus-4-6:    swebench 74.4  swtbench 66.7  gaia 66.1  commit0 50.0  swe-bench-multimodal 32.4
Claude Code / claude-sonnet-4-5:  swebench 74.4  swtbench 69.3  gaia 63.0  commit0 31.2  swe-bench-multimodal 35.3

Test plan

  • Reviewer: load the Space preview built from this PR, confirm the leaderboard table now has an Agent column and shows Claude Code / Codex / OpenHands Sub-agents rows.
  • Confirm the existing OpenHands rows look unchanged (same scores, no missing entries).

Two follow-ups on top of the existing PR:

f00a617 โ€” Show harness in scatter-plot hover
The Alternative Agents scatter labels points with the Language Model's company logo but nothing else, so two points for the same LLM run under different harnesses (claude-sonnet-4-5 under Claude Code vs OpenHands Sub-agents) are visually indistinguishable on hover. Added a Harness: line to format_hover_text that only shows up when the row has an Agent value โ€” the canonical OpenHands pages drop that column in DataTransformer.view() so this is a no-op there.

55dc87f โ€” Screenshot of the page in action
Added docs/screenshots/alternative-agents.png so reviewers can see the live state after the ACP benchmark data restoration (openhands-index-results#832โ€“#843): all seven harness ร— model rows render complete, the scatter plot shows the populated cost-vs-score frontier, and the "Show incomplete entries" toggle reports zero by default.

Alternative Agents page

gneubig changed pull request status to merged

Sign up or log in to comment