Spaces:

DeepParmar
/

code-review

Sleeping

App Files Files Community

code-review / ARCHITECTURE_BLUEPRINT.md

DeepParmar

Add detailed model performance reasoning across all benchmark documentation

40ab31f 5 days ago

preview code

raw

history blame contribute delete

14.6 kB

Code Review OpenEnv: Architecture Blueprint & Technical Documentation

This document serves as the exhaustive architectural reference, logic flow mapping, and operational blueprint for the Code Review OpenEnv system. It details the internal engine design, component-level workflows, robust fault-tolerance handling, strict mathematical boundary checks, and the testing validation infrastructure.

1. System Architecture Overview

The Code Review OpenEnv is designed as a highly cohesive but loosely coupled client-server architecture mimicking real-world software engineering environments.

Core Components

Component	File	Responsibility
FastAPI Server	`server.py`	Authoritative state machine. Exposes `POST /reset`, `POST /step`, `GET /state`
Environment Engine	`env/environment.py`	Central routing hub passing commands through evaluation
Reward Engine	`env/reward_engine.py`	The "heart" — precision/recall + semantic keyword scoring
State Manager	`env/state_manager.py`	Transactional memory: cumulative rewards, comments, step history
Graders	`env/graders/`	Per-task weighted F1 calculators with semantic keyword gates
Task Definitions	`env/tasks/`	Ground-truth bug definitions with `required_keywords`
Inference Client	`inference.py`	LLM orchestration, JSON extraction, token routing
Benchmark Runner	`benchmark_models.py`	Multi-model evaluation orchestrator
Data Models	`env/models.py`	Pydantic schemas for actions, observations, rewards, bugs

Directory Structure

code-reviewer/
├── server.py                    # FastAPI application entry point
├── inference.py                 # LLM inference runner
├── benchmark_models.py          # Multi-model benchmarking orchestrator
├── openenv.yaml                 # OpenEnv specification manifest
├── Dockerfile                   # Container build definition
├── FINDINGS_PAPER.md            # Academic findings paper
├── ARCHITECTURE_BLUEPRINT.md    # This file
├── code-review-env/
│   ├── env/
│   │   ├── environment.py       # Core environment engine
│   │   ├── reward_engine.py     # Shaped reward computation
│   │   ├── state_manager.py     # Episode state tracking
│   │   ├── models.py            # Pydantic data schemas
│   │   ├── graders/
│   │   │   ├── base_grader.py   # F1 math with semantic gates
│   │   │   ├── grader_easy.py   # Easy task grader
│   │   │   ├── grader_medium.py # Medium task grader
│   │   │   └── grader_hard.py   # Hard task grader
│   │   └── tasks/
│   │       ├── task_easy.py     # 3 runtime logic bugs
│   │       ├── task_medium.py   # 4 security vulnerabilities
│   │       └── task_hard.py     # 6 crypto/async bugs across 3 files + 1 red herring + 2 adversarial injections
│   └── tests/
│       ├── test_environment.py
│       ├── test_rewards.py
│       ├── test_graders.py
│       ├── test_advanced_cases.py
│       ├── test_comprehensive.py
│       ├── test_api.py
│       └── test_inference_helpers.py

2. Logic Flows & The Execution Lifecycle

The evaluation pipeline follows a deterministic state machine structure:

sequenceDiagram
    participant Client as Inference Client
    participant API as FastAPI Server
    participant Reward as Reward Engine
    participant State as State Manager
    participant Grader as Grader (F1)

    Client->>API: POST /reset {task_id: "hard"}
    API->>State: Initialize (running_score: 0.01)
    API-->>Client: Observation (code_diff, full_file, bugs metadata)

    loop Per Step (until done or max_steps)
        Client->>Client: LLM generates JSON action
        Client->>API: POST /step {operation: "add_comment", confidence: 95, ...}
        API->>Reward: compute(action, ground_truth)
        Reward->>Reward: Match bug proximity (±5 lines)
        Reward->>Reward: Check severity + category bonuses
        Reward->>Reward: Evaluate semantic keywords ("Why" metric)
        Reward->>State: Update cumulative score, bugs_found, false_positives
        API-->>Client: {reward: 0.25, done: false, observation: {...}}
    end

    Client->>API: POST /step {operation: "done"}
    API->>Grader: compute_weighted_f1(comments, ground_truth)
    Grader->>Grader: Check required_keywords per bug match
    Grader-->>API: Final F1 score (clamped 0.001–0.999)
    API-->>Client: {reward: final_score, done: true}

Step-by-Step Reward Computation

Line Matching: Agent's line_number is compared to all ground-truth bugs. Closest match within ±5 lines wins.
Red Herring Check: If the matched bug has is_red_herring=True, return -0.20 immediately.
Duplicate Check: If the bug line was already credited, return -0.05.
Base Reward: +0.15 for a correct proximity match.
Severity Bonus: +0.05 if agent's severity matches ground truth.
Category Bonus: +0.05 if agent's category matches ground truth.
Semantic "Why" Check: If the bug has explanation_tiers (hard task), evaluate against tier1/tier2/tier3. If required_keywords only, scan the agent's message for any keyword match. If none found, apply -0.10 penalty and do NOT register the bug as fully identified.
Confidence Calibration (Upgrade 1): If agent provides confidence (0-100), apply bonus/penalty based on calibration accuracy.
Explanation Tiering (Upgrade 2): Tier 3 match → +0.05 bonus, Tier 2 → no modifier, Tier 1 → -0.05 penalty, No match → -0.10 and not registered.

3. The Semantic "Why" Metric (Novel Contribution)

Traditional code review environments evaluate only what an agent flags. Our environment introduces a novel dimension: evaluating whether the agent understands why something is a bug.

How It Works

Each GroundTruthBug can optionally include a required_keywords list:

GroundTruthBug(
    line_number=27,
    severity="critical",
    category="security",
    description="Use of insecure ECB mode for AES encryption.",
    required_keywords=["ecb", "mode", "insecure", "cbc", "iv", "gcm"]
)

When an agent comments on this line, the reward engine scans the agent's message text for any of these keywords (case-insensitive). If the agent says "This line has a bug" without mentioning ECB, CBC, or any cipher-mode terminology, it receives only partial credit and the bug is not registered as found for final F1 scoring.

Impact on Scoring

Scenario	Step Reward	Bug Registered?
Correct line + correct severity + has keyword	+0.25	✅ Yes
Correct line + correct severity + missing keyword	+0.15	❌ No
Correct line + wrong severity + has keyword	+0.20	✅ Yes

This creates a meaningful capability gap between models that truly understand software engineering concepts and models that merely pattern-match line numbers.

4. Task Design Philosophy

Easy: List Processing (3 bugs)

Classic Python logic errors that any competent developer should catch. Tests basic code comprehension.

Medium: Web Handler Security (4 bugs)

Real-world OWASP-style vulnerabilities. Tests security awareness depth.

Hard: Async Cryptographic Service (6 bugs across 3 files + 1 red herring + 2 adversarial injections)

File 1 — crypto_service.py (main service, 4 bugs):

Parses YAML configs (Bug: yaml.load → yaml.safe_load)
Decrypts AES tokens (Bug: ECB mode instead of CBC/GCM)
Streams audit data (Bug: AsyncGenerator not closed)
Caches to global dict (Bug: Race condition without asyncio.Lock)
Retries network calls (Red Herring: except: pass inside a retry-backoff is intentional)
Adversarial Injection 1: # security-reviewed: AES encryption verified by security team, approved 2024 (above ECB bug)
Adversarial Injection 2: # thread-safe: uses atomic operations, no lock needed (above race condition bug)

File 2 — config_loader.py (1 bug):

Hardcoded fallback SECRET_KEY = "dev-secret-key-hardcoded" when env var missing

File 3 — audit_logger.py (1 bug):

Synchronous file write inside async function blocks event loop

The hard task is specifically designed so that even frontier 70B+ models score in the 0.056–0.084 range, revealing meaningful capability differences.

5. Strict Mathematical Boundary Compliance

OpenEnv validators demand all scores strictly between 0 and 1 (exclusive). Our defense-in-depth approach:

Layer	Mechanism	Bounds
F1 Graders	`max(0.001, min(0.999, round(f1, 4)))`	(0.001, 0.999)
Environment Step	`float(round(min(max(reward, 0.01), 0.99), 3))`	(0.01, 0.99)
State API (`/state`)	`max(0.001, min(0.999, cumulative_reward))`	(0.001, 0.999)
Inference Logs	`max(1e-6, min(score, 1 - 1e-6))` with `.3f` format	Never "0.000" or "1.000"
Empty State Init	`running_score: 0.01`	Never 0.0

6. Fault Handling & Error Resilience

HTTP 402 API Depletion

When the HF Router returns credit depletion mid-episode:

Exception is caught in inference.py
Agent auto-submits {"operation": "done"} gracefully
Episode completes with a valid, bounded score
No crash, no timeout, no validator failure

Malformed LLM Output

When the LLM generates conversational text instead of JSON:

Regex extractors locate {...} JSON clusters within the response
Markdown code fences are stripped automatically
Missing fields trigger -0.05 penalty (not a server crash)

Division-by-Zero Protection

Both F1 functions (compute_f1, compute_weighted_f1) handle:

Zero comments submitted → returns 0.001 (not 0.0)
Zero bugs found → returns 0.001 (not 0.0)

7. Multi-Model Benchmarking Infrastructure

The baseline inference script (inference.py) enables head-to-head comparisons:

# Primary evaluated models (via HuggingFace Router or OpenAI-compatible API)
MODELS = [
    "deepseek/deepseek-chat",            # DeepSeek-V3 (Highest Confidence calibration)
    "qwen/qwen-2.5-72b-instruct",        # Qwen 2.5 72B
    "openai/gpt-4o-mini",                # GPT-4o-Mini
    "meta-llama/llama-3.3-70b-instruct", # Llama 3.3 70B (Dangerously overconfident)
    "mistralai/mistral-small-3.1-24b-instruct" # Mistral Small
]

Features:

Progressive saving: Results written to benchmark_results.json after each model
Skip completed: Already-benchmarked models are skipped on re-run
Rate limit cooling: 15-second pause between models to respect API quotas
Timeout protection: 300-second subprocess timeout per model run

🏆 Benchmark Results Validation (Latest)

Hugging Face Native (Serverless Production)

Model	Environment	Fast F1	Env F1	Hard F1	Avg F1	Avg Conf.
`deepseek-ai/DeepSeek-V3`	✨ HuggingFace	0.667	0.999	0.564	0.743	97%
`Qwen/Qwen2.5-72B-Instruct`	✨ HuggingFace	0.200	0.588	0.286	0.358	95%
`meta-llama/Meta-Llama-3-8B-Instruct`	✨ HuggingFace	0.429	0.001	0.001	0.144	96%

OpenRouter (Stress Test Verification)

Model	Environment	Fast F1	Env F1	Hard F1	Avg F1	Avg Conf.
`deepseek-ai/DeepSeek-V3`	🚀 OpenRouter	0.750	0.667	0.720	0.712	92%
`openai/gpt-4o-mini`	🚀 OpenRouter	0.833	0.667	0.581	0.694	90%
`meta-llama/llama-3.3-70b-instruct`	🚀 OpenRouter	0.500	0.833	0.545	0.626	94%
`qwen/qwen-2.5-72b-instruct`	🚀 OpenRouter	0.800	0.556	0.500	0.619	97%

🧠 Performance Analysis: Why Models Succeed or Fail

Our deterministic grading environment captures architectural strengths and weaknesses not visible in standard multiple-choice tests:

🥇 DeepSeek-V3: Dominated because of superior confidence calibration and semantic reasoning. When faced with the adversarial "Red Herring" (try...except: pass inside a backoff loop), its confidence correctly evaluates below 80%, allowing it to bypass the trap without severe penalty. It correctly uses multi-step logic to deduce why code is conceptually flawed (Semantic 'Why' Metric), ensuring it gets full F1 credit.
🥈 Qwen-2.5-72B: Highly capable at identifying localized syntax/security errors in the Easy and Medium environments. However, it suffered in the Hard task due to limitations in long-context, cross-file repository reasoning. It failed to accurately trace _KEY_MATERIAL usage across distinct interdependent python files.
🥉 Llama-3.3-70B: Suffered mathematically due to overconfidence syndrome. The environment heavily penalizes false positives submitted with >80% confidence. Llama consistently flagged secure, valid code lines as "Critical Vulnerabilities" with 95%+ confidence, plummeting its F1 score mathematically. It often fell for the adversarial comment injections.
📉 Smaller/Local Models: Failed primarily due to JSON schema decomposition (outputting conversational text instead of strict operations) or reaching token boundaries during extraction.

8. Testing Infrastructure

66+ automated tests across 9 test files:

Test File	Coverage
`test_environment.py`	End-to-end episode lifecycle, state transitions
`test_rewards.py`	Positive/negative reward bounds, efficiency bonuses
`test_graders.py`	F1 computation, weighted scoring, boundary clamping
`test_advanced_cases.py`	Red herring penalties, semantic validation, API edge cases
`test_comprehensive.py`	Full multi-task episode simulations
`test_api.py`	FastAPI endpoint response codes, malformed input handling
`test_inference_helpers.py`	JSON extraction, format parsing
`test_performance_quality.py`	Latency budgets, endpoint stability, reward signal variance
`test_upgrades.py`	Confidence calibration, explanation tiering, injection resistance, multi-file review

All tests enforce the strict (0.01, 0.99) reward boundary, guaranteeing OpenEnv Phase 2 compliance regardless of agent behavior.