# System Behavior: evaluation > Living document. Updated by `/archive-spec` when features are completed. > Last archived: F005 on 2026-03-28 --- ## Added ### Automated multi-episode evaluation The system accepts an environment, a policy, and an episode count, then produces an EvaluationResult containing success_rate, avg_reward, avg_steps, and a per-episode breakdown. Evaluation runs all requested episodes and returns structured metrics in a single call. ### Incremental result collection on failure When an individual episode fails (environment error or policy error), the system records the failure in the per-episode breakdown and continues evaluating remaining episodes. Partial results are never lost. ### Random baseline policy The system provides a built-in random policy that accepts an SQLObservation and returns a random SQLAction. Given the same seed, the random policy produces identical action sequences across runs. ### Progress callback during evaluation The evaluate function accepts an optional progress callback that receives (current_episode, total_episodes) after each episode completes, enabling progress reporting for long evaluation runs.