Spaces:
Running
Running
code-review-env
Core environment package for Code Review OpenEnv.
Structure
env/
βββ environment.py # Reset / step loop
βββ models.py # Pydantic schemas
βββ reward_engine.py # Dense reward computation
βββ state_manager.py # Observation tracking
βββ graders/ # Per-task deterministic graders
βββ tasks/ # Task definitions (easy, medium, hard)
server.py # FastAPI endpoints
inference.py # Implementation engine (Execute from ROOT via ../inference.py)
tests/ # Pytest suite (70 tests)
Endpoints
| Method | Path | Purpose |
|---|---|---|
GET |
/health |
Health check |
POST |
/reset |
Start task (body: {"task_id": "easy"}) |
POST |
/step |
Submit action, get observation + reward |
GET |
/state |
Debug current state |
Inference Modes
| Mode | Env Var | LLM Needed | Deterministic |
|---|---|---|---|
| Benchmark | REVIEW_STRATEGY=benchmark |
No | Yes |
| LLM | REVIEW_STRATEGY=llm |
Yes | No |
Features: schema normalization, line clamping, early-stop on complete findings, deterministic fallback on provider errors, telemetric confidence calibration tracking, red herring traps, adversarial injection hooks.
Benchmark Results (Latest)
For a complete breakdown, refer to benchmark_comparison.md in the repository root.
Hugging Face Native (Production Phase):
| Model | Environment | Avg F1 | Avg Conf |
|---|---|---|---|
deepseek-ai/DeepSeek-V3 |
β¨ HuggingFace | 0.743 | 97% |
Qwen/Qwen2.5-72B-Instruct |
β¨ HuggingFace | 0.358 | 95% |
meta-llama/Meta-Llama-3-8B-Instruct |
β¨ HuggingFace | 0.144 | 96% |
OpenRouter (Final Validation):
| Model | Environment | Avg F1 | Avg Conf |
|---|---|---|---|
deepseek-ai/DeepSeek-V3 |
π OpenRouter | 0.712 | 92% |
openai/gpt-4o-mini |
π OpenRouter | 0.694 | 90% |
meta-llama/llama-3.3-70b-instruct |
π OpenRouter | 0.626 | 94% |
qwen/qwen-2.5-72b-instruct |
π OpenRouter | 0.619 | 97% |
π§ Performance Analysis: Why Models Succeed or Fail
- DeepSeek-V3: Excels due to perfect confidence calibration. It ignores adversarial red herring traps and accurately links vulnerabilities across multiple files.
- Qwen-2.5-72B: Strong at localized syntax checking but weak at long-context, cross-file reasoning tracking variables between modules.
- Llama-3.3-70B: Severely punished by the F1 grader for "overconfidence syndrome"βguessing wildly at false-positives with
95%certainty. - Small Models: Primarily fail due to JSON parsing collapse or timeout limits while analyzing heavy source files.
Tests
python -m pytest tests -v # 70 passed