sql_env / specs /behavior /evaluation.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

System Behavior: evaluation

Living document. Updated by /archive-spec when features are completed. Last archived: F005 on 2026-03-28


Added

Automated multi-episode evaluation

The system accepts an environment, a policy, and an episode count, then produces an EvaluationResult containing success_rate, avg_reward, avg_steps, and a per-episode breakdown. Evaluation runs all requested episodes and returns structured metrics in a single call.

Incremental result collection on failure

When an individual episode fails (environment error or policy error), the system records the failure in the per-episode breakdown and continues evaluating remaining episodes. Partial results are never lost.

Random baseline policy

The system provides a built-in random policy that accepts an SQLObservation and returns a random SQLAction. Given the same seed, the random policy produces identical action sequences across runs.

Progress callback during evaluation

The evaluate function accepts an optional progress callback that receives (current_episode, total_episodes) after each episode completes, enabling progress reporting for long evaluation runs.