# Research Summary **Project:** SQLEnv **Change:** F003 — Dense Reward System (3-layer reward architecture) **Date:** 2026-03-27 **Status:** Draft --- ## 1. Change Overview ### What We're Changing Implement the 3-layer reward architecture in `server/reward.py`: - **Layer 1 (Operational):** exec_ok +0.02, new_info +0.01 (capped 0.10), repeat -0.01, step_cost -0.005 - **Layer 2 (Progress):** Weighted average of cardinality matching + value overlap + numeric range proximity, binned to 5 levels, improvement-only, ×0.15 - **Layer 3 (Terminal):** +1.0 correct, 0.0 incorrect/timeout Wire into `step()` so non-terminal steps return meaningful reward signals. ### Why We're Changing It Currently all non-terminal steps return `reward=None`. Agents get no learning signal until ANSWER. Dense rewards make GRPO training converge. ### Success Criteria - Reward varies meaningfully: random exploration ~0.1, targeted queries ~0.3, correct answer ~1.3 - Anti-gaming: can't farm rewards by describing everything or repeating queries - Progress signal coarsened (5 bins) to prevent reward hill-climbing - Total step rewards clamped to [-0.2, +0.5] --- ## 2. System Context ### Current Behavior - `server/reward.py` is a docstring-only stub — all reward logic needs to be built from scratch - `step()` returns `reward=None` for DESCRIBE/SAMPLE/QUERY actions - `_handle_answer()` returns 1.0 or 0.0 — the only reward signal - `EpisodeContext` tracks `described_tables` (set) and `action_log` (list) but no reward accumulators ### Architecture Context ``` step(action) ├── DESCRIBE → _handle_describe() → result string ├── SAMPLE → _handle_sample() → result string ├── QUERY → _handle_query() → result string └── ANSWER → _handle_answer() → (bool, reward) After action execution (NEW): reward.compute_step_reward(episode_ctx, action_type, query_rows, error) ├── Layer 1: operational signals ├── Layer 2: progress-to-target (QUERY only) └── clamp to [-0.2, 0.5] running total ``` ### Entry Points | Entry Point | Trigger | Current Flow | |-------------|---------|--------------| | `step()` | Every agent action | Action dispatch → observation (reward=None) | | `compute_step_reward()` | **To be created** — called from `step()` | Per-step reward from layers 1+2 | ### Data Flow | Data | Source | Shape/Type | Destination | |------|--------|------------|-------------| | Action type + result | `step()` dispatch | `str`, `list[tuple]` | Layer 1 | | Query result rows | `_execute_sql()` | `list[tuple]` | Layer 2 progress | | Gold result rows | `_execute_gold_sql()` at reset | `list[tuple]` | Layer 2 reference — **must store in EpisodeContext** | | Described tables | `EpisodeContext.described_tables` | `set[str]` | Layer 1 new_info | | Query hashes | **Need to add** to EpisodeContext | `set[str]` | Layer 1 repeat detection | | Best progress | **Need to add** to EpisodeContext | `float` | Layer 2 improvement tracking | | Cumulative reward | **Need to add** to EpisodeContext | `float` | Clamping | **Critical gap:** `EpisodeContext` stores `gold_answer` as formatted string only. Layer 2 needs raw `list[tuple]` gold rows. Must add `gold_rows: list[tuple]` field and populate at `reset()`. --- ## 3. Dependencies ### Code We Depend On | Dependency | What We Use | Risk if Changed | |------------|-------------|-----------------| | `models.py:EpisodeContext` | Episode state — needs new fields | Must add reward tracking fields | | `sql_environment.py:_execute_sql()` | Returns `list[tuple]` for QUERY | Need raw rows passed to reward | | `sql_environment.py:_execute_gold_sql()` | Returns `list[tuple]` at reset | Already returns raw rows — just store them | | F002 (verifier.py) | Terminal correctness | Being built in parallel — Layer 3 can use naive check initially | ### Code That Depends On Us | Dependent | How They Use Us | Impact of Our Change | |-----------|-----------------|---------------------| | `sql_environment.py:step()` | Calls `compute_step_reward()` | Must integrate into step flow | | F006 (GRPO Training) | `reward_funcs` for TRL trainer | Components exposed as separate functions | | `tests/test_smoke.py` | Asserts `reward=None` for non-ANSWER | **Will break** — tests need updating | --- ## 4. Risks & Edge Cases ### Identified Risks | Risk | Likelihood | Impact | Mitigation | |------|------------|--------|------------| | Reward hacking via progress signal | Medium | Agent exploits shaping | Coarsen to 5 bins, cap step rewards, small magnitudes | | Test breakage | High | 25 existing tests | Update test assertions for non-None rewards | | Gold rows unavailable | Low | Layer 2 can't compute | Fallback: Layer 1 only | ### Edge Cases to Handle | Edge Case | Current Behavior | Required Behavior | |-----------|------------------|-------------------| | QUERY returns empty result | reward=None | Layer 1: exec_ok (+0.02), Layer 2: cardinality=0 | | QUERY fails with SQL error | reward=None, error set | Layer 1: step_cost only (-0.005) | | DESCRIBE same table twice | reward=None | repeat penalty (-0.01), no new_info | | Gold answer is empty | reward=None | Skip Layer 2, Layer 1 only | | Budget exhausted without ANSWER | reward=0.0 | Terminal: 0.0 + clamped step rewards | ### Invariants to Preserve - [ ] Terminal correctness always dominates — correct answer ≥ 1.0 - [ ] Step rewards clamped to [-0.2, +0.5] total - [ ] Reward is deterministic given same episode state --- ## 4b. Code Shape & Design Target ### Existing Vocabulary | Concept | Existing Name | Location | |---------|---------------|----------| | Episode state | `EpisodeContext` | `models.py:135` | | Described tables | `described_tables: set[str]` | `models.py:143` | | Action log | `action_log: list[str]` | `models.py:144` | ### Target Shape | Component | Purpose | Why This Boundary | |-----------|---------|-------------------| | `compute_step_reward(ctx, action_type, rows, error)` | Main entry | Single public entry for step() | | `_layer1_operational(ctx, action_type, sql, rows, error)` | Operational signals | Stateless except episode tracking | | `_layer2_progress(ctx, rows)` | Progress-to-target (QUERY only) | Needs gold_rows comparison | | `_cardinality_score(pred_rows, gold_rows)` | Row count comparison | Tier 1 metric | | `_value_overlap_score(pred_rows, gold_rows)` | Jaccard set overlap | Tier 1 metric | | `_numeric_range_score(pred_rows, gold_rows)` | Log-distance for numbers | Tier 1 metric | | `_bin_progress(raw_score)` | Bin to {0, 0.25, 0.5, 0.75, 1.0} | Anti-gaming | ### Abstraction Level - **Current level:** Flat — server modules with plain functions - **Recommendation:** Match flat style. `server/reward.py` with plain functions. ### Anti-Patterns to Avoid - Don't create reward strategy classes - Don't add row-wise best match initially (add if training shows need) - Don't import numpy/scipy — pure Python - Don't re-execute gold SQL per step — cache at reset() --- ## 5. Constraints ### Technical Constraints | Constraint | Requirement | Notes | |------------|-------------|-------| | Performance | < 5ms per reward computation | ~15 calls per episode | | No heavy deps | Pure Python | No numpy/scipy | | Deterministic | Same inputs → same reward | Required for reproducible training | ### Testing Constraints | Test Suite | Coverage Area | Notes | |------------|---------------|-------| | `tests/test_smoke.py` | 25 tests, some assert `reward=None` | Must update for non-None step rewards | --- ## 6. Open Questions | Question | Why It Matters | Who Can Answer | |----------|----------------|----------------| | Layer 2 combination: weighted average (0.25/0.50/0.25) or adaptive? | Affects reward quality | Default: Method 1 per reward_design.md | | Store `gold_rows` in EpisodeContext or separate cache? | Design coupling | Recommend EpisodeContext field | --- ## 7. Context Sources | Source | Type | Notes | |--------|------|-------| | `server/reward.py` | Code (stub) | Docstring describes 3-layer architecture | | `server/sql_environment.py` | Code | step() flow, _execute_sql() | | `models.py:EpisodeContext` | Code | Needs new reward-tracking fields | | `docs_draft/SQLEnv_Concept_v1.md` Section 3 | Doc | Complete reward spec | | `docs_draft/reward_design.md` | Doc | Distance metrics, combination methods | | `docs_draft/reward-research_gpt-5-2.md` | Doc | Reward research |