File size: 2,292 Bytes
5dd1bb4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
# System Behavior: Training

> Living document. Updated by `/archive-spec` when features are completed.
> Last archived: F006 on 2026-03-28

---

## Training Pipeline

### Training notebook produces a trained model from one-click execution
<!-- since: F006 | test: tests/training/test_config.py::test_grpo_config_defaults -->

The system accepts a `notebooks/train_grpo.ipynb` notebook that, when run end-to-end, downloads a HuggingFace model, trains it on SQLEnv episodes using GRPO, and saves the trained weights to a configurable output directory.

### Training produces a learning curve showing reward improvement
<!-- since: F006 -->

After training completes, the notebook displays a matplotlib plot of reward over training steps, showing whether the model learned to improve its SQL exploration strategy over the course of training.

### Training produces side-by-side episode transcripts
<!-- since: F006 -->

After training completes, the notebook displays episode transcripts comparing random-action baseline episodes against trained-model episodes on the same questions, showing the difference in exploration behavior.

### Rollout function plays SQLEnv episodes via model generation
<!-- since: F006 | test: tests/training/test_rollout.py::test_rollout_func -->

The system accepts a batch of question prompts and returns episode completions by playing full SQLEnv episodes: resetting the environment, generating actions with HF model.generate(), parsing them into SQLActions, and stepping the environment until the episode ends.

### Reward functions return per-completion scores for GRPO training
<!-- since: F006 | test: tests/training/test_rewards.py::test_reward_correctness -->

The system accepts TRL-format completion batches and returns float reward lists from three independent callables: correctness (binary 0/1), progress (normalized cumulative progress), and operational (sum of per-step L1 signals).

### Unparseable model output falls back to QUERY action
<!-- since: F006 | test: tests/training/test_rollout.py::test_parse_model_output_fallback -->

When the model produces text that cannot be parsed as `ACTION_TYPE: argument` format, the system defaults to a QUERY action with the raw text as the argument, allowing the episode to continue rather than crashing.