Spaces:

OpenHands
/

openhands-index

Running

App Files Files Community

Show ACP agent results in the leaderboard

#11

by simonrosenberg1 - opened 9 days ago

base: refs/heads/main

←

from: refs/pr/11

Discussion Files changed

+533

-130

Show ACP agent results in the leaderboardbef7adee

simonrosenberg1

9 days ago

Summary

The HF Space currently only loads results/{model}/ (default OpenHands runs).
The ACP runs (acp-claude, acp-codex, acp-gemini, openhands_subagents)
live in alternative_agents/{type}/{model}/ in the openhands-index-results
repo and never made it into the dataframe, so the website silently dropped
them. After OpenHands/openhands-index-results#820–#829 + #830, all the ACP
Claude Code data from the master table in OpenHands/benchmarks#576 is in
the canonical location, but the leaderboard still doesn't show it.

This PR teaches the loader to ingest alternative_agents/ and adds an
Agent column to the leaderboard so OpenHands vs Claude Code vs Codex
vs Gemini CLI are visible at a glance.

Changes

setup_data.py — copy alternative_agents/ alongside results/ when fetching the index repo, so all submissions land in the data dir.
simple_data_loader.py:
- Factor per-directory loading into _records_from_agent_dir and have _load_from_agent_dirs walk both results/ and alternative_agents/{type}/{model}/.
- Default agent_name per agent_type (Claude Code / Codex / Gemini CLI / OpenHands Sub-agents), matching the AGENT_NAME_BY_TYPE map in OpenHands/evaluation push_to_index_from_archive.py.
- Include agent_name in agent_id (name_version_model) so an OpenHands run and a Claude Code run on the same SDK version + model don't collide into one row.
- Surface agent_name on the transformed record.
leaderboard_transformer.py:
- Map agent_name → "Agent" in _pretty_column_name.
- Insert "Agent" into base_cols between id and Language Model.

Local verification

Cloned the latest openhands-index-results and pointed the loader at it.
The loader now returns 29 rows: 24 OpenHands + 2 Claude Code + 1 Codex + 2
OpenHands Sub-agents. The new Claude Code rows match the master table in
OpenHands/benchmarks#576:

Claude Code / claude-opus-4-6:    swebench 74.4  swtbench 66.7  gaia 66.1  commit0 50.0  swe-bench-multimodal 32.4
Claude Code / claude-sonnet-4-5:  swebench 74.4  swtbench 69.3  gaia 63.0  commit0 31.2  swe-bench-multimodal 35.3

Test plan

Reviewer: load the Space preview built from this PR, confirm the leaderboard table now has an Agent column and shows Claude Code / Codex / OpenHands Sub-agents rows.
Confirm the existing OpenHands rows look unchanged (same scores, no missing entries).

Move alternative agents to a dedicated paged7034b85

Show harness in scatter-plot hover on Alternative Agents pagef00a617b

Add Alternative Agents page screenshot55dc87fa

simonrosenberg1

8 days ago

Two follow-ups on top of the existing PR:

f00a617 — Show harness in scatter-plot hover
The Alternative Agents scatter labels points with the Language Model's company logo but nothing else, so two points for the same LLM run under different harnesses (claude-sonnet-4-5 under Claude Code vs OpenHands Sub-agents) are visually indistinguishable on hover. Added a Harness: line to format_hover_text that only shows up when the row has an Agent value — the canonical OpenHands pages drop that column in DataTransformer.view() so this is a no-op there.

55dc87f — Screenshot of the page in action
Added docs/screenshots/alternative-agents.png so reviewers can see the live state after the ACP benchmark data restoration (openhands-index-results#832–#843): all seven harness × model rows render complete, the scatter plot shows the populated cost-vs-score frontier, and the "Show incomplete entries" toggle reports zero by default.

Add harness logo assets for Alternative Agents composite markers6860223b

Stack model + harness logos on Alternative Agents scatterfb16d57a

Replace harness badge letters with generic pictograms8a1c0edd

Move harness logos to assets/harnesses/ with graceful missing-file fallback282557d8

Add harness logos for Claude Code, Codex CLI, Gemini CLI, OpenHandse6877f36

gneubig changed pull request status to merged 2 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment