System Behavior: evaluation
Living document. Updated by
/archive-specwhen features are completed. Last archived: F005 on 2026-03-28
Added
Automated multi-episode evaluation
The system accepts an environment, a policy, and an episode count, then produces an EvaluationResult containing success_rate, avg_reward, avg_steps, and a per-episode breakdown. Evaluation runs all requested episodes and returns structured metrics in a single call.
Incremental result collection on failure
When an individual episode fails (environment error or policy error), the system records the failure in the per-episode breakdown and continues evaluating remaining episodes. Partial results are never lost.
Random baseline policy
The system provides a built-in random policy that accepts an SQLObservation and returns a random SQLAction. Given the same seed, the random policy produces identical action sequences across runs.
Progress callback during evaluation
The evaluate function accepts an optional progress callback that receives (current_episode, total_episodes) after each episode completes, enabling progress reporting for long evaluation runs.