System Behavior: sql-environment
Living document. Updated by
/archive-specwhen features are completed. Last archived: F003 on 2026-03-28
Added
Environment accepts structured actions
The environment accepts four structured action types via POST /step: DESCRIBE, SAMPLE, QUERY, and ANSWER. Each action carries an argument field containing the table name, SQL string, or answer value. The environment executes the action directly without calling an external LLM.
DESCRIBE returns column schema from live database
When an agent sends DESCRIBE with a table name, the environment returns column names, types, and row count queried from the actual SQLite database. If the table does not exist, the environment returns an error listing all available tables.
SAMPLE returns rows from live database
When an agent sends SAMPLE with a table name, the environment executes SELECT * FROM table LIMIT 5 against the SQLite database and returns formatted rows. If the table does not exist, the environment returns an error listing available tables.
QUERY executes SQL against live database
When an agent sends QUERY with a SQL string, the environment validates that the query is a SELECT statement, executes it against the read-only SQLite database with a 5-second timeout, and returns formatted results truncated to 20 rows. Non-SELECT queries produce a clear rejection message. Syntax errors and timeouts produce descriptive error messages.
ANSWER compares agent response to gold answer
When an agent sends ANSWER with a value, the environment compares it to the pre-computed gold answer, sets the episode as done, and returns a reward of 1.0 (correct) or 0.0 (incorrect).
Reset produces a random question with hidden schema
Calling POST /reset selects a random question from the Spider dataset, opens a read-only SQLite database, and returns an observation containing the question text and table names only. Column details are hidden until the agent DESCRIBEs individual tables.
Environment enforces a 15-step budget
Each DESCRIBE, SAMPLE, or QUERY action decrements the step budget. When the budget reaches zero, the episode ends with done=True and reward=0.0. ANSWER actions do not consume budget.
Observations carry rich structured fields
Every observation returned by reset or step includes: question, schema_info, result, error, step_count, budget_remaining, and action_history. These replace the previous messages-only format.
Type-aware answer verification
The environment accepts agent answers that match the gold answer after type-aware comparison. Integer answers are coerced ("42" matches 42), float answers allow 1% relative tolerance (95000.1 matches 95000), and list answers are compared order-insensitively ("A, B" matches "B, A").
Fallback string comparison for unknown answer types
When question metadata has no answer_type (or an unknown type), answer verification falls back to case-insensitive, whitespace-normalized string comparison for backward compatibility.
Dense step rewards for exploration actions
The environment returns a numeric reward on every non-terminal step (DESCRIBE, SAMPLE, QUERY). Previously these steps returned no reward signal. Reward reflects operational quality (successful execution, new schema discovery) and, for QUERY actions, progress toward the gold answer.
Repeat query penalty
The environment penalizes an agent that submits the same SQL query more than once within an episode. The penalty is small (-0.01) but discourages reward farming through repetition.
New-info discovery reward
The environment rewards an agent for discovering new schema information. The cumulative new-info reward is capped at 0.10 per episode to prevent farming.
Progress-to-answer signal for QUERY actions
When an agent issues a QUERY, the environment compares result rows against the gold answer using cardinality, value overlap, and numeric proximity metrics. The agent receives a reward proportional to improvement over its previous best progress, coarsened to 5 bins to prevent hill-climbing.
Cumulative step reward clamping
The environment clamps cumulative step rewards to the range [-0.2, +0.5]. This ensures terminal correctness (Layer 3: +1.0 or 0.0) always dominates total episode reward.
Modified
reset() now loads a question and opens a database
Before: reset() cleared message history and returned an observation containing only the system prompt as a chat message. After: reset() selects a random Spider question, opens a read-only SQLite connection to the corresponding database, computes the gold answer, and returns a rich observation with the question text and available table names.
step() now executes actions deterministically
Before: step() dispatched to Ollama for table selection (DESCRIBE/SAMPLE) and SQL generation (QUERY). No SQL was ever executed against a database. After: step() reads action_type and argument directly from the agent's structured action and executes against a live SQLite database. No external LLM is involved.
SQLAction uses argument field instead of action_description
Before: SQLAction carried action_description (free-text NL description) and tokens (torch.Tensor).
After: SQLAction carries argument (structured value: table name, SQL, or answer). The tokens field is removed.
SQLObservation returns structured fields instead of chat messages
Before: SQLObservation contained messages (list of chat messages) and tokens (flattened tensor).
After: SQLObservation contains question, schema_info, result, error, step_count, budget_remaining, and action_history. The messages and tokens fields are removed.
Answer correctness determination
Before: ANSWER correctness was based on lowercased/trimmed string equality only, so semantically correct numeric/list answers could be rejected.
After: ANSWER correctness dispatches through a type-aware verifier (integer, float, string, list) with gold_rows support for structured list comparison.
Non-terminal step reward value
Before: Non-terminal steps (DESCRIBE, SAMPLE, QUERY) produced a reward of None in observations.
After: Non-terminal steps produce a numeric float reward reflecting operational and progress signals. Consumers should use done (not reward is None) to detect terminality.