File size: 7,456 Bytes
5dd1bb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
# System Behavior: sql-environment

> Living document. Updated by `/archive-spec` when features are completed.
> Last archived: F003 on 2026-03-28

---

## Added

### Environment accepts structured actions
<!-- since: F001 -->

The environment accepts four structured action types via POST /step: DESCRIBE, SAMPLE, QUERY, and ANSWER. Each action carries an `argument` field containing the table name, SQL string, or answer value. The environment executes the action directly without calling an external LLM.

### DESCRIBE returns column schema from live database
<!-- since: F001 -->

When an agent sends DESCRIBE with a table name, the environment returns column names, types, and row count queried from the actual SQLite database. If the table does not exist, the environment returns an error listing all available tables.

### SAMPLE returns rows from live database
<!-- since: F001 -->

When an agent sends SAMPLE with a table name, the environment executes `SELECT * FROM table LIMIT 5` against the SQLite database and returns formatted rows. If the table does not exist, the environment returns an error listing available tables.

### QUERY executes SQL against live database
<!-- since: F001 -->

When an agent sends QUERY with a SQL string, the environment validates that the query is a SELECT statement, executes it against the read-only SQLite database with a 5-second timeout, and returns formatted results truncated to 20 rows. Non-SELECT queries produce a clear rejection message. Syntax errors and timeouts produce descriptive error messages.

### ANSWER compares agent response to gold answer
<!-- since: F001 -->

When an agent sends ANSWER with a value, the environment compares it to the pre-computed gold answer, sets the episode as done, and returns a reward of 1.0 (correct) or 0.0 (incorrect).

### Reset produces a random question with hidden schema
<!-- since: F001 -->

Calling POST /reset selects a random question from the Spider dataset, opens a read-only SQLite database, and returns an observation containing the question text and table names only. Column details are hidden until the agent DESCRIBEs individual tables.

### Environment enforces a 15-step budget
<!-- since: F001 -->

Each DESCRIBE, SAMPLE, or QUERY action decrements the step budget. When the budget reaches zero, the episode ends with done=True and reward=0.0. ANSWER actions do not consume budget.

### Observations carry rich structured fields
<!-- since: F001 -->

Every observation returned by reset or step includes: question, schema_info, result, error, step_count, budget_remaining, and action_history. These replace the previous messages-only format.

### Type-aware answer verification
<!-- since: F002 | test: tests/test_verifier.py::test_verify_answer_integer -->

The environment accepts agent answers that match the gold answer after type-aware comparison. Integer answers are coerced (`"42"` matches `42`), float answers allow 1% relative tolerance (`95000.1` matches `95000`), and list answers are compared order-insensitively (`"A, B"` matches `"B, A"`).

### Fallback string comparison for unknown answer types
<!-- since: F002 | test: tests/test_verifier.py::test_verify_answer_fallback -->

When question metadata has no `answer_type` (or an unknown type), answer verification falls back to case-insensitive, whitespace-normalized string comparison for backward compatibility.

### Dense step rewards for exploration actions
<!-- since: F003 | test: tests/test_reward.py::test_compute_step_reward_query -->

The environment returns a numeric reward on every non-terminal step (DESCRIBE, SAMPLE, QUERY). Previously these steps returned no reward signal. Reward reflects operational quality (successful execution, new schema discovery) and, for QUERY actions, progress toward the gold answer.

### Repeat query penalty
<!-- since: F003 | test: tests/test_reward.py::test_repeat_penalty -->

The environment penalizes an agent that submits the same SQL query more than once within an episode. The penalty is small (-0.01) but discourages reward farming through repetition.

### New-info discovery reward
<!-- since: F003 | test: tests/test_reward.py::test_new_info_reward -->

The environment rewards an agent for discovering new schema information. The cumulative new-info reward is capped at 0.10 per episode to prevent farming.

### Progress-to-answer signal for QUERY actions
<!-- since: F003 | test: tests/test_reward.py::test_layer2_progress -->

When an agent issues a QUERY, the environment compares result rows against the gold answer using cardinality, value overlap, and numeric proximity metrics. The agent receives a reward proportional to improvement over its previous best progress, coarsened to 5 bins to prevent hill-climbing.

### Cumulative step reward clamping
<!-- since: F003 | test: tests/test_reward.py::test_clamping -->

The environment clamps cumulative step rewards to the range [-0.2, +0.5]. This ensures terminal correctness (Layer 3: +1.0 or 0.0) always dominates total episode reward.

## Modified

### reset() now loads a question and opens a database
<!-- since: F001 | previously: initial -->

**Before:** reset() cleared message history and returned an observation containing only the system prompt as a chat message.
**After:** reset() selects a random Spider question, opens a read-only SQLite connection to the corresponding database, computes the gold answer, and returns a rich observation with the question text and available table names.

### step() now executes actions deterministically
<!-- since: F001 | previously: initial -->

**Before:** step() dispatched to Ollama for table selection (DESCRIBE/SAMPLE) and SQL generation (QUERY). No SQL was ever executed against a database.
**After:** step() reads action_type and argument directly from the agent's structured action and executes against a live SQLite database. No external LLM is involved.

### SQLAction uses argument field instead of action_description
<!-- since: F001 | previously: initial -->

**Before:** SQLAction carried `action_description` (free-text NL description) and `tokens` (torch.Tensor).
**After:** SQLAction carries `argument` (structured value: table name, SQL, or answer). The tokens field is removed.

### SQLObservation returns structured fields instead of chat messages
<!-- since: F001 | previously: initial -->

**Before:** SQLObservation contained `messages` (list of chat messages) and `tokens` (flattened tensor).
**After:** SQLObservation contains `question`, `schema_info`, `result`, `error`, `step_count`, `budget_remaining`, and `action_history`. The messages and tokens fields are removed.

### Answer correctness determination
<!-- since: F002 | previously: F001 | test: tests/test_verifier.py -->

**Before:** ANSWER correctness was based on lowercased/trimmed string equality only, so semantically correct numeric/list answers could be rejected.
**After:** ANSWER correctness dispatches through a type-aware verifier (`integer`, `float`, `string`, `list`) with `gold_rows` support for structured list comparison.

### Non-terminal step reward value
<!-- since: F003 | previously: F001 | test: tests/test_smoke.py -->

**Before:** Non-terminal steps (DESCRIBE, SAMPLE, QUERY) produced a reward of `None` in observations.
**After:** Non-terminal steps produce a numeric float reward reflecting operational and progress signals. Consumers should use `done` (not `reward is None`) to detect terminality.