# code-review-env Core environment package for Code Review OpenEnv. ## Structure ``` env/ ├── environment.py # Reset / step loop ├── models.py # Pydantic schemas ├── reward_engine.py # Dense reward computation ├── state_manager.py # Observation tracking ├── graders/ # Per-task deterministic graders └── tasks/ # Task definitions (easy, medium, hard) server.py # FastAPI endpoints inference.py # Implementation engine (Execute from ROOT via ../inference.py) tests/ # Pytest suite (70 tests) ``` ## Endpoints | Method | Path | Purpose | |--------|------|---------| | `GET` | `/health` | Health check | | `POST` | `/reset` | Start task (body: `{"task_id": "easy"}`) | | `POST` | `/step` | Submit action, get observation + reward | | `GET` | `/state` | Debug current state | ## Inference Modes | Mode | Env Var | LLM Needed | Deterministic | |------|---------|:---:|:---:| | Benchmark | `REVIEW_STRATEGY=benchmark` | No | Yes | | LLM | `REVIEW_STRATEGY=llm` | Yes | No | Features: schema normalization, line clamping, early-stop on complete findings, deterministic fallback on provider errors, telemetric confidence calibration tracking, red herring traps, adversarial injection hooks. ## Benchmark Results (Latest) For a complete breakdown, refer to `benchmark_comparison.md` in the repository root. **Hugging Face Native (Production Phase):** | Model | Environment | Avg F1 | Avg Conf | |---|---|---|---| | `deepseek-ai/DeepSeek-V3` | ✨ HuggingFace | **0.743** | 97% | | `Qwen/Qwen2.5-72B-Instruct` | ✨ HuggingFace | **0.358** | 95% | | `meta-llama/Meta-Llama-3-8B-Instruct` | ✨ HuggingFace | **0.144** | 96% | **OpenRouter (Final Validation):** | Model | Environment | Avg F1 | Avg Conf | |---|---|---|---| | `deepseek-ai/DeepSeek-V3` | 🚀 OpenRouter | **0.712** | 92% | | `openai/gpt-4o-mini` | 🚀 OpenRouter | **0.694** | 90% | | `meta-llama/llama-3.3-70b-instruct` | 🚀 OpenRouter | **0.626** | 94% | | `qwen/qwen-2.5-72b-instruct` | 🚀 OpenRouter | **0.619** | 97% | ### 🧠 Performance Analysis: Why Models Succeed or Fail - **DeepSeek-V3:** Excels due to perfect **confidence calibration**. It ignores adversarial red herring traps and accurately links vulnerabilities across multiple files. - **Qwen-2.5-72B:** Strong at localized syntax checking but weak at long-context, cross-file reasoning tracking variables between modules. - **Llama-3.3-70B:** Severely punished by the F1 grader for "overconfidence syndrome"—guessing wildly at false-positives with `95%` certainty. - **Small Models:** Primarily fail due to JSON parsing collapse or timeout limits while analyzing heavy source files. ## Tests ```bash python -m pytest tests -v # 70 passed ```