Structural Problems in AI Benchmarking and the Case for a Unified Evaluation Framework
1. Limitations of the Current AI Benchmark Ecosystem
As of March 2026, the AI model evaluation landscape suffers from several structural issues.
Benchmark Saturation. MMLU scores climbed from 70% in 2022 to over 90% by 2025, effectively eliminating discriminative power among frontier models. The top 10 models now fall within a 2% spread — statistically indistinguishable. The field has responded by migrating to harder benchmarks (GPQA Diamond, HLE, ARC-AGI-2), but each operates independently, fragmenting the picture of a model's "overall intelligence" across disconnected leaderboards.
Source Opacity. Most leaderboards publish provider self-reported scores without independent verification. During our cross-verification process, we uncovered significant discrepancies:
- Claude Opus 4.6 ARC-AGI-2: listed as 37.6% on some leaderboards → verified value 68.8% (confirmed by Vellum, NxCode, DataCamp — three independent sources)
- Gemini 3.1 Pro ARC-AGI-2: listed as 88.1% → actual 77.1% per Google DeepMind model card and arcprize.org
- GPT-5.3 Codex SWE-Pro: listed as 78.2% → confusion with SWE-Verified score. Actual SWE-Pro is 57.0%
These are not typos. They stem from structural issues: benchmark name confusion (SWE-bench Verified vs. SWE-bench Pro), version conflation, and missing attribution.
SWE-bench Verified Is Defective. OpenAI's own audit found that 59.4% of SWE-bench Verified tasks are flawed — incomplete tests, or tests that pass incorrect code. Despite this, most leaderboards still treat it as a primary coding metric. A high SWE-Verified score does not reliably indicate coding ability. ALL Bench excludes SWE-Verified from composite scoring and substitutes LiveCodeBench (LCB), which is continuously updated and designed to be resistant to data contamination.
Modality Silos. LLMs, VLMs, image generators, and video generators are evaluated on entirely separate leaderboards. When an organization asks "which AI should we adopt?", there is no single reference point that compares text, vision, coding, and generation capabilities side by side.
2. Composite Score Design: 5-Axis Framework
Single-benchmark rankings reflect only one facet of a model. The GPQA leader may rank poorly at coding; the SWE-bench leader may struggle with abstract reasoning. As of March 2026, no single model ranks first on all axes simultaneously.
To address this, we designed a 5-Axis Intelligence Framework:
| Axis | Representative Benchmarks | What It Measures |
|---|---|---|
| Knowledge | MMLU-Pro (57K questions) | Graduate-level interdisciplinary knowledge |
| Expert Reasoning | GPQA Diamond · AIME · HLE | PhD-level scientific reasoning · Math olympiad · Expert-sourced final exam |
| Abstract Reasoning | ARC-AGI-2 | Novel pattern recognition absent from training data |
| Metacognition | FINAL Bench | Self-error recognition and correction ability |
| Execution | SWE-Pro · BFCL · IFEval · LCB | Code generation · Function calling · Instruction following · Live coding |
Composite score formula:
Score = Avg(verified benchmark scores) × √(N/10)
The √(N/10) penalty reflects data coverage. A model with only 3 of 10 benchmarks confirmed receives a 0.55× discount, preventing overvaluation of models that score high on a narrow set. A model with all 10 confirmed receives the full 1.0× multiplier.
3. Confidence System: 3-Tier Cross-Verification
To address the source opacity problem, we introduced a 3-tier confidence system:
| Level | Criteria | Example |
|---|---|---|
| ✓✓ Cross-verified | 2+ independent sources | "Anthropic official + Vellum independent eval + DataCamp verification" |
| ✓ Single-source | One official or third-party source | "Google DeepMind model card" |
| ~ Self-reported | Provider claim only, no independent verification | "Self-reported benchmark" |
We tracked sources across all 42 LLMs × 12 benchmark columns. Key finding: flagship models (GPT-5.2, Claude Opus 4.6, Gemini 3.1 Pro) achieved ✓✓ status on most major scores, while several Korean Sovereign AI models remained at ~ status, highlighting the need for independent evaluation of emerging national AI projects.
4. Metacognitive Benchmarking: A Blind Spot in Current Evaluation
The largest gap in existing benchmark frameworks is the absence of metacognition evaluation. A model "knowing the answer" and a model "recognizing it was wrong and correcting itself" are fundamentally different capabilities.
FINAL Bench was designed to measure this, evaluating self-correction ability across 100 tasks on 9 frontier models. Key findings:
- Error Recovery (ER) explains 94.8% of metacognitive performance variance. The core of metacognition is "correcting wrong answers," not "maintaining correct ones."
- Of 42 models in ALL Bench, only 9 have FINAL Bench scores. Metacognition is simply not measured in most evaluation frameworks.
- The spread between rank 1 Kimi K2.5 (68.71) and rank 9 Gemini (59.5) is 9.2 points — roughly 3× the spread among top models on GPQA (1–3 points), indicating far greater discriminative power.
This suggests metacognition may be the single largest differentiating factor among current frontier models.
5. Asymmetries Discovered in VLM Evaluation
Cross-verifying 11 flagship VLMs in v2.1 revealed several noteworthy asymmetries:
MMMU vs. MMMU-Pro Rank Reversal. Claude Opus 4.6 has no officially reported MMMU score, yet ranks #1 on MMMU-Pro at 85.1%. Gemini 3 Flash leads MMMU at 87.6% but drops to 80.0% on MMMU-Pro (rank ~4). MMMU-Pro filters out questions solvable by text-only models and expands choices from 4 to 10 — a hardened version. Models whose rankings diverge between the two benchmarks may differ in their reliance on visual information versus text-based reasoning shortcuts.
Flash > Pro Reversal. Gemini 3 Flash (87.6) narrowly outperforms Gemini 3 Pro (87.5) on MMMU. Flash models are generally assumed to be Pro's lightweight counterpart, but this assumption may not hold for multimodal understanding.
Open-Source Convergence. InternVL3-78B achieves MMMU 72.2, OCRBench 906/1000, and MMBench 89.0 — GPT-4o-class performance with open weights. Kimi-VL-A3B-Thinking achieves MathVista 80.1 with only 3B active parameters, a notable result in terms of parameter efficiency.
6. Data Availability and Reproducibility
All data is publicly released:
- HuggingFace Dataset — 6 configs (llm / vlm / agent / image / video / music) as JSONL + unified JSON (75KB)
- GitHub — Full source code + data + CITATION.cff
from datasets import load_dataset
llm = load_dataset("FINAL-Bench/ALL-Bench-Leaderboard", "llm", split="train")
vlm = load_dataset("FINAL-Bench/ALL-Bench-Leaderboard", "vlm_flagship", split="train")
Confidence metadata is included in the unified JSON as per-model, per-benchmark {level, source} objects, enabling programmatic source tracing for any individual score.
7. Limitations and Future Work
Non-standardized evaluation conditions. The same benchmark can yield different scores depending on prompt format, token limits, and thinking mode settings. ALL Bench currently adopts the most widely reported configuration for each score, but full standardization has not been achieved.
Quantification gap in generative AI. Image, video, and music generation models are currently rated on an S/A/B/C scale. As quantitative benchmarks (GenAI-Bench, VBench) publish numerical scores, we plan to transition from qualitative to quantitative evaluation.
Coverage bias. English-language benchmarks dominate. Multilingual evaluation relies on MMMLU (57 languages) alone. Fine-grained assessment of individual languages — Korean, Japanese, Chinese, etc. — remains future work.
Links
- 🏆 ALL Bench Leaderboard
- 📦 ALL Bench Dataset
- 🧬 FINAL Bench Dataset · FINAL Bench Leaderboard
- ⚡ GitHub
#AIBenchmark #LLMEvaluation #FINALBench #Metacognition #VLM #ARC-AGI #SWE-bench #BenchmarkMethodology #OpenData #ReproducibleAI