Agentic-MME: What Agentic Capability Really Brings to Multimodal Intelligence?
Abstract
A new benchmark evaluates multimodal agentic capabilities by verifying tool usage and process efficiency rather than just final answers, revealing significant challenges in real-world multimodal problem-solving.
Multimodal Large Language Models (MLLMs) are evolving from passive observers into active agents, solving problems through Visual Expansion (invoking visual tools) and Knowledge Expansion (open-web search). However, existing evaluations fall short: they lack flexible tool integration, test visual and search tools separately, and evaluate primarily by final answers. Consequently, they cannot verify if tools were actually invoked, applied correctly, or used efficiently. To address this, we introduce Agentic-MME, a process-verified benchmark for Multimodal Agentic Capabilities. It contains 418 real-world tasks across 6 domains and 3 difficulty levels to evaluate capability synergy, featuring over 2,000 stepwise checkpoints that average 10+ person-hours of manual annotation per task. Each task includes a unified evaluation framework supporting sandboxed code and APIs, alongside a human reference trajectory annotated with stepwise checkpoints along dual-axis: S-axis and V-axis. To enable true process-level verification, we audit fine-grained intermediate states rather than just final answers, and quantify efficiency via an overthinking metric relative to human trajectories. Experimental results show the best model, Gemini3-pro, achieves 56.3% overall accuracy, which falls significantly to 23.0% on Level-3 tasks, underscoring the difficulty of real-world multimodal agentic problem solving.
Community
Insightful Work!
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- AgentVista: Evaluating Multimodal Agents in Ultra-Challenging Realistic Visual Scenarios (2026)
- VTC-Bench: Evaluating Agentic Multimodal Models via Compositional Visual Tool Chaining (2026)
- BrowseComp-V3: A Visual, Vertical, and Verifiable Benchmark for Multimodal Browsing Agents (2026)
- OmniGAIA: Towards Native Omni-Modal AI Agents (2026)
- ATP-Bench: Towards Agentic Tool Planning for MLLM Interleaved Generation (2026)
- MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains (2026)
- VisBrowse-Bench: Benchmarking Visual-Native Search for Multimodal Browsing Agents (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
i appreciate that agentic-mme upgrades evaluation by auditing intermediate states instead of only final answers. my concern is the heavy dependency on human-labeled step trajectories, which could bias what counts as a good strategy and hurts scalability to new domains. would you share how robust the process-verification is to annotation noise, and whether the overthinking metric stays meaningful when tool outputs are uncertain or noisy? the arXivLens breakdown helped me parse the method details, and it's a nice touch that it lines up with your AST-based tracer for normalizing diverse coding styles.
Get this paper in your agent:
hf papers read 2604.03016 Don't have the latest CLI?
curl -LsSf https://hf.co/cli/install.sh | bash Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper