Papers
arxiv:2604.06132

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Published on Apr 7
· Submitted by
Lei Li
on Apr 8
#2 Paper of the day
Authors:
,
,
,
,
,
,
,
,
,
,

Abstract

Claw-Eval addresses limitations in agent benchmarks by providing comprehensive evaluation across multiple modalities with trajectory-aware grading and safety assessments.

AI-generated summary

Large language models are increasingly deployed as autonomous agents executing multi-step workflows in real-world software environments. However, existing agent benchmarks suffer from three critical limitations: (1) trajectory-opaque grading that checks only final outputs, (2) underspecified safety and robustness evaluation, and (3) narrow modality coverage and interaction paradigms. We introduce Claw-Eval, an end-to-end evaluation suite addressing all three gaps. It comprises 300 human-verified tasks spanning 9 categories across three groups (general service orchestration, multimodal perception and generation, and multi-turn professional dialogue). Every agent action is recorded through three independent evidence channels (execution traces, audit logs, and environment snapshots), enabling trajectory-aware grading over 2,159 fine-grained rubric items. The scoring protocol evaluates Completion, Safety, and Robustness, reporting Average Score, Pass@k, and Pass^k across three trials to distinguish genuine capability from lucky outcomes. Experiments on 14 frontier models reveal that: (1) trajectory-opaque evaluation is systematically unreliable, missing 44% of safety violations and 13% of robustness failures that our hybrid pipeline catches; (2) controlled error injection primarily degrades consistency rather than peak capability, with Pass^3 dropping up to 24% while Pass@3 remains stable; (3) multimodal performance varies sharply, with most models performing poorer on video than on document or image, and no single model dominating across all modalities. Beyond benchmarking, Claw-Eval highlights actionable directions for agent development, shedding light on what it takes to build agents that are not only capable but reliably deployable.

Community

Paper author Paper submitter

Claw-Eval

the trajectory-aware evaluation in claw-eval is exactly the move we needed to separate genuine capability from lucky outputs. my question: how sensitive are Pass@k and Pass^k to the logging cadence or to desynchronization among traces, audit logs, and environment snapshots? the arxivlens breakdown helped me parse the method details, especially how they use three evidence channels (https://arxivlens.com/PaperView/Details/claw-eval-toward-trustworthy-evaluation-of-autonomous-agents-5634-89016710). an ablation varying the frequency of snapshots or removing one channel could reveal which piece actually drives the robustness gains.

Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents

Claw-Eval is an end-to-end evaluation suite for autonomous agents that addresses a critical blind spot in current benchmarks: evaluating only final outputs misses dangerous intermediate behaviors. The suite covers 300 tasks across 9 categories and introduces trajectory-aware grading over 2,159 rubric items. By collecting three independent evidence channels and running Pass@k scoring across 3 trials, Claw-Eval provides trustworthy measurements of Completion, Safety, and Robustness. Testing 14 frontier models reveals that trajectory-opaque evaluation misses 44% of safety violations and 13% of robustness failures.

Key Idea

Claw-Eval collects three independent evidence channels for every agent execution: execution traces (step-by-step actions), audit logs (system-level records), and environment snapshots (state of the world at checkpoints). By cross-referencing all three channels, the grader catches problems that any single channel would miss -- such as an agent that produces a correct final output but takes dangerous intermediate steps.

ThreeChannels

Method / Approach

The key methodological insight is the difference between trajectory-opaque and trajectory-aware evaluation. Trajectory-opaque evaluation only examines the final output, while trajectory-aware evaluation inspects the entire sequence of agent actions. The paper demonstrates that opaque evaluation systematically underestimates risk: it misses 44% of safety violations (e.g., agents accessing unauthorized resources before producing correct results) and 13% of robustness failures (e.g., agents that succeed through brittle retry loops).

OpaqueVsAware

Results

Claw-Eval scores agents on three dimensions -- Completion (did it finish the task?), Safety (did it avoid harmful behaviors?), and Robustness (does it succeed consistently?) -- using Pass@k across 3 independent trials per task to account for variance. Across 14 evaluated frontier models, no model achieves strong scores on all three dimensions simultaneously, revealing fundamental tradeoffs in current agent architectures. The 2,159 rubric items ensure fine-grained coverage across the 300 tasks and 9 categories.

ScoringTriad

Sign up or log in to comment

Get this paper in your agent:

hf papers read 2604.06132
Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2604.06132 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2604.06132 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2604.06132 in a Space README.md to link it from this page.

Collections including this paper 2