ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces
Abstract
ClawsBench evaluates LLM agents in realistic productivity workflows using high-fidelity mock services and structured tasks, revealing varying task success and unsafe action rates across models and agent configurations.
Large language model (LLM) agents are increasingly deployed to automate productivity tasks (e.g., email, scheduling, document management), but evaluating them on live services is risky due to potentially irreversible changes. Existing benchmarks rely on simplified environments and fail to capture realistic, stateful, multi-service workflows. We introduce ClawsBench, a benchmark for evaluating and improving LLM agents in realistic productivity settings. It includes five high-fidelity mock services (Gmail, Slack, Google Calendar, Google Docs, Google Drive) with full state management and deterministic snapshot/restore, along with 44 structured tasks covering single-service, cross-service, and safety-critical scenarios. We decompose agent scaffolding into two independent levers (domain skills that inject API knowledge via progressive disclosure, and a meta prompt that coordinates behavior across services) and vary both to measure their separate and combined effects. Experiments across 6 models, 4 agent harnesses, and 33 conditions show that with full scaffolding, agents achieve task success rates of 39-64% but exhibit unsafe action rates of 7-33%. On OpenClaw, the top five models fall within a 10 percentage-point band on task success (53-63%), with unsafe action rates from 7% to 23% and no consistent ordering between the two metrics. We identify eight recurring patterns of unsafe behavior, including multi-step sandbox escalation and silent contract modification.
Community
We introduce ClawsBench, a benchmark for evaluating both capability and safety of LLM productivity agents in simulated workspaces.
Key highlights:
• 5 high-fidelity mock services (Gmail, Calendar, Docs, Drive, Slack) validated against real APIs with 169 golden fixtures
• 44 tasks including 14 cross-service and 24 safety-critical scenarios
• 6 models × 4 harnesses × scaffolding factorial = 33 conditions, 7,224 trials
Main findings:
• Scaffolding (+39-63pp) dwarfs model differences (10pp spread among top 5)
• No safety-capability tradeoff: Opus leads TSR (63%) but ties for worst UAR (23%)
• 8 recurring rogue behaviors identified including sandbox escalation and prompt injection compliance
Dataset with 7,834 agent traces available: https://huggingface.co/datasets/benchflow/ClawsBench
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents (2026)
- SkillsBench: Benchmarking How Well Agent Skills Work Across Diverse Tasks (2026)
- Beyond Binary Correctness: Scaling Evaluation of Long-Horizon Agents on Subjective Enterprise Tasks (2026)
- EnterpriseOps-Gym: Environments and Evaluations for Stateful Agentic Planning and Tool Use in Enterprise Settings (2026)
- SKILLS: Structured Knowledge Injection for LLM-Driven Telecommunications Operations (2026)
- SkillCraft: Can LLM Agents Learn to Use Tools Skillfully? (2026)
- SWE-AGI: Benchmarking Specification-Driven Software Construction with MoonBit in the Era of Autonomous Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Get this paper in your agent:
hf papers read 2604.05172 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 1
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper