DeepParmar's picture
Add detailed model performance reasoning across all benchmark documentation
40ab31f

code-review-env

Core environment package for Code Review OpenEnv.

Structure

env/
β”œβ”€β”€ environment.py      # Reset / step loop
β”œβ”€β”€ models.py           # Pydantic schemas
β”œβ”€β”€ reward_engine.py    # Dense reward computation
β”œβ”€β”€ state_manager.py    # Observation tracking
β”œβ”€β”€ graders/            # Per-task deterministic graders
└── tasks/              # Task definitions (easy, medium, hard)
server.py               # FastAPI endpoints
inference.py            # Implementation engine (Execute from ROOT via ../inference.py)
tests/                  # Pytest suite (70 tests)

Endpoints

Method Path Purpose
GET /health Health check
POST /reset Start task (body: {"task_id": "easy"})
POST /step Submit action, get observation + reward
GET /state Debug current state

Inference Modes

Mode Env Var LLM Needed Deterministic
Benchmark REVIEW_STRATEGY=benchmark No Yes
LLM REVIEW_STRATEGY=llm Yes No

Features: schema normalization, line clamping, early-stop on complete findings, deterministic fallback on provider errors, telemetric confidence calibration tracking, red herring traps, adversarial injection hooks.

Benchmark Results (Latest)

For a complete breakdown, refer to benchmark_comparison.md in the repository root.

Hugging Face Native (Production Phase):

Model Environment Avg F1 Avg Conf
deepseek-ai/DeepSeek-V3 ✨ HuggingFace 0.743 97%
Qwen/Qwen2.5-72B-Instruct ✨ HuggingFace 0.358 95%
meta-llama/Meta-Llama-3-8B-Instruct ✨ HuggingFace 0.144 96%

OpenRouter (Final Validation):

Model Environment Avg F1 Avg Conf
deepseek-ai/DeepSeek-V3 πŸš€ OpenRouter 0.712 92%
openai/gpt-4o-mini πŸš€ OpenRouter 0.694 90%
meta-llama/llama-3.3-70b-instruct πŸš€ OpenRouter 0.626 94%
qwen/qwen-2.5-72b-instruct πŸš€ OpenRouter 0.619 97%

🧠 Performance Analysis: Why Models Succeed or Fail

  • DeepSeek-V3: Excels due to perfect confidence calibration. It ignores adversarial red herring traps and accurately links vulnerabilities across multiple files.
  • Qwen-2.5-72B: Strong at localized syntax checking but weak at long-context, cross-file reasoning tracking variables between modules.
  • Llama-3.3-70B: Severely punished by the F1 grader for "overconfidence syndrome"β€”guessing wildly at false-positives with 95% certainty.
  • Small Models: Primarily fail due to JSON parsing collapse or timeout limits while analyzing heavy source files.

Tests

python -m pytest tests -v   # 70 passed