# System Behavior: evaluation

> Living document. Updated by `/archive-spec` when features are completed.
> Last archived: F005 on 2026-03-28

---

## Added

### Automated multi-episode evaluation
<!-- since: F005 | test: tests/test_evaluation.py::test_evaluate_returns_correct_metrics -->

The system accepts an environment, a policy, and an episode count, then produces an EvaluationResult containing success_rate, avg_reward, avg_steps, and a per-episode breakdown. Evaluation runs all requested episodes and returns structured metrics in a single call.

### Incremental result collection on failure
<!-- since: F005 | test: tests/test_evaluation.py::test_evaluate_survives_episode_failure -->

When an individual episode fails (environment error or policy error), the system records the failure in the per-episode breakdown and continues evaluating remaining episodes. Partial results are never lost.

### Random baseline policy
<!-- since: F005 | test: tests/test_evaluation.py::test_random_policy_deterministic -->

The system provides a built-in random policy that accepts an SQLObservation and returns a random SQLAction. Given the same seed, the random policy produces identical action sequences across runs.

### Progress callback during evaluation
<!-- since: F005 | test: tests/test_evaluation.py::test_progress_callback_called -->

The evaluate function accepts an optional progress callback that receives (current_episode, total_episodes) after each episode completes, enabling progress reporting for long evaluation runs.