Spaces:
Sleeping
Sleeping
Commit Β·
12b944d
1
Parent(s): 40ab31f
Update FINDINGS_PAPER.md with latest benchmarks and standardize DeepSeek-V3 ID
Browse files- FINDINGS_PAPER.md +18 -12
FINDINGS_PAPER.md
CHANGED
|
@@ -105,11 +105,11 @@ The hard task now spans three files (`crypto_service.py`, `config_loader.py`, `a
|
|
| 105 |
|
| 106 |
| Model | Parameters | Specialization |
|
| 107 |
|-------|-----------|---------------|
|
| 108 |
-
| `deepseek
|
| 109 |
| `qwen/qwen-2.5-72b-instruct` | 72B | General + Code |
|
| 110 |
| `openai/gpt-4o-mini` | Small | Fast / General |
|
| 111 |
| `meta-llama/llama-3.3-70b-instruct` | 70B | General |
|
| 112 |
-
| `mistralai/mistral-small` | 24B | General + Code |
|
| 113 |
|
| 114 |
All five models were evaluated on April 11, 2026 via the OpenRouter API using identical system prompts and temperature settings (temperature=0.2). Each model completed all three tasks (easy, medium, hard) in sequential concurrent parallel threads mapping the Telemetric Confidence score.
|
| 115 |
|
|
@@ -131,23 +131,29 @@ All five models were evaluated on April 11, 2026 via the OpenRouter API using id
|
|
| 131 |
|
| 132 |
## 4. Results
|
| 133 |
|
| 134 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
| 135 |
|
| 136 |
-
|
| 137 |
-
|
| 138 |
-
|
|
| 139 |
-
|
|
| 140 |
-
|
|
| 141 |
-
|
|
| 142 |
-
|
|
|
|
|
|
|
|
| 143 |
|
| 144 |
### 4.2 Key Findings
|
| 145 |
|
| 146 |
**Finding 1: LLM "Self-Awareness" varies drastically (The Confidence Telemetry test)**
|
| 147 |
-
By enforcing a `confidence` metric in the returned JSON matching OpenEnv specifications, we proved that Llama-3.3-70B and Mistral-Small are dangerously overconfident. Llama-3 generated 19 "High-Confidence Wrong" bugs, suffering severe F1 penalties. DeepSeek-
|
| 148 |
|
| 149 |
**Finding 2: The hard task produces meaningful score variance.**
|
| 150 |
-
Hard task scores range from 0.
|
| 151 |
|
| 152 |
**Finding 3: False positive penalty is highly impactful.**
|
| 153 |
Qwen-2.5-72B scored highest on easy (0.727) but collapsed to 0.500 on hard. Analysis of the step logs shows Qwen generated many false positives, diluting precision. The weighted F1 grader correctly crushed its score using the Telemetric Calibration Modifier.
|
|
|
|
| 105 |
|
| 106 |
| Model | Parameters | Specialization |
|
| 107 |
|-------|-----------|---------------|
|
| 108 |
+
| `deepseek-ai/DeepSeek-V3` | MoE | Code-specialized |
|
| 109 |
| `qwen/qwen-2.5-72b-instruct` | 72B | General + Code |
|
| 110 |
| `openai/gpt-4o-mini` | Small | Fast / General |
|
| 111 |
| `meta-llama/llama-3.3-70b-instruct` | 70B | General |
|
| 112 |
+
| `mistralai/mistral-small-3.1-24b` | 24B | General + Code |
|
| 113 |
|
| 114 |
All five models were evaluated on April 11, 2026 via the OpenRouter API using identical system prompts and temperature settings (temperature=0.2). Each model completed all three tasks (easy, medium, hard) in sequential concurrent parallel threads mapping the Telemetric Confidence score.
|
| 115 |
|
|
|
|
| 131 |
|
| 132 |
## 4. Results
|
| 133 |
|
| 134 |
+
| Native Model Identifier | Environment | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
|
| 135 |
+
| :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
|
| 136 |
+
| `deepseek-ai/DeepSeek-V3` | β¨ **HuggingFace** | 0.667 | **0.999** | 0.564 | **0.743** | 97% |
|
| 137 |
+
| `Qwen/Qwen2.5-72B-Instruct` | β¨ **HuggingFace** | 0.200 | 0.588 | 0.286 | **0.358** | 95% |
|
| 138 |
+
| `meta-llama/Meta-Llama-3-8B-Instruct` | β¨ **HuggingFace** | 0.429 | 0.001 | 0.001 | **0.144** | 96% |
|
| 139 |
|
| 140 |
+
### 4.2 Post-Submission OpenRouter Benchmarks
|
| 141 |
+
|
| 142 |
+
| Native Model Identifier | Environment | Easy F1 | Medium F1 | Hard F1 | **Avg F1** | Avg Conf. |
|
| 143 |
+
| :---------------------- | :---------- | :------ | :-------- | :------ | :--------- | :-------- |
|
| 144 |
+
| `deepseek-ai/DeepSeek-V3` | π **OpenRouter** | 0.750 | 0.667 | 0.720 | **0.712** | 92% |
|
| 145 |
+
| `openai/gpt-4o-mini` | π **OpenRouter** | 0.833 | 0.667 | 0.581 | **0.694** | 90% |
|
| 146 |
+
| `meta-llama/llama-3.3-70b-instruct` | π **OpenRouter** | 0.500 | 0.833 | 0.545 | **0.626** | 94% |
|
| 147 |
+
| `qwen/qwen-2.5-72b-instruct` | π **OpenRouter** | 0.800 | 0.556 | 0.500 | **0.619** | 97% |
|
| 148 |
+
| `mistralai/mistral-small-3.1-24b` | π **OpenRouter** | 0.001 | 0.001 | 0.999 | **0.334** | 100% |
|
| 149 |
|
| 150 |
### 4.2 Key Findings
|
| 151 |
|
| 152 |
**Finding 1: LLM "Self-Awareness" varies drastically (The Confidence Telemetry test)**
|
| 153 |
+
By enforcing a `confidence` metric in the returned JSON matching OpenEnv specifications, we proved that Llama-3.3-70B and Mistral-Small are dangerously overconfident. Llama-3 generated 19 "High-Confidence Wrong" bugs, suffering severe F1 penalties. DeepSeek-V3, conversely, achieved 8 High-Confidence Correct answers to only 1 Wrong answer.
|
| 154 |
|
| 155 |
**Finding 2: The hard task produces meaningful score variance.**
|
| 156 |
+
Hard task scores range from 0.001 (Mistral) to 0.720 (DeepSeek), demonstrating the environment genuinely differentiates model capability on complex multi-file, multi-domain bugs. No model achieves ceiling performance, confirming the task is appropriately challenging for frontier models.
|
| 157 |
|
| 158 |
**Finding 3: False positive penalty is highly impactful.**
|
| 159 |
Qwen-2.5-72B scored highest on easy (0.727) but collapsed to 0.500 on hard. Analysis of the step logs shows Qwen generated many false positives, diluting precision. The weighted F1 grader correctly crushed its score using the Telemetric Calibration Modifier.
|