Feature Demo: F002 — Answer Verification
Generated: 2026-03-27T22:37:50Z Context source: spec + discovery only (implementation not read) Feature entry: FEATURES.json #F002
What This Feature Does
When an agent submits an ANSWER, this feature makes the final pass/fail decision robust to common formatting and representation differences. From a user perspective, it reduces false negatives where the agent is semantically correct but uses a different format (for example numeric formatting differences, casing differences, or reordered list values).
The intended experience is clear and predictable scoring: tolerant float matching, order-insensitive list matching, and unambiguous terminal reward outcomes with fewer frustrating “technically wrong but practically right” rejections.
What Is Already Proven
Verified in This Demo Run
- Happy-path typed verification scenarios pass for integer, float, string, and list dispatch paths.
- Full integration flow through environment
step()passes for integer/float/string/list and fallback behavior. - Edge and error behavior is exercised locally: empty predicted input fails, float tolerance boundary checks pass/fail correctly, and integer coercion failure returns zero reward.
- Existing smoke coverage for answer episode termination still passes.
Previously Verified Evidence
specs/FEATURES.json(F002.verification_evidence) records verifier-approved run:uv run pytest tests/ -vwith 65/65 passed at2026-03-27T22:33:12Z.specs/F002-IMPLEMENTATION_SPEC.mdSection 7 records completed step evidence including full suite pass and integration pass.
What Still Needs User Verification
- Run one manual episode in your target runtime (your exact dataset/runtime environment) and submit a known-correct
ANSWERwith alternate formatting (for example42.0vs42) to confirm behavior in your end-to-end setup.
Quickstart / Verification Steps
Run these commands to see the feature in action:
uv run pytest tests/test_verifier_integration.py -v
uv run pytest tests/test_verifier.py -v -k "test_verify_integer_exact_match or test_verify_float_within_tolerance or test_verify_string_case_insensitive or test_verify_list_order_insensitive"
Prerequisite: dependencies installed via uv sync.
Live Local Proof
Validate typed ANSWER handling through environment flow
This runs the integration scenarios that exercise answer verification via the real environment step flow.
uv run pytest tests/test_verifier_integration.py -v
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification
configfile: pyproject.toml
plugins: cov-7.1.0, anyio-4.13.0
collecting ... collected 6 items
tests/test_verifier_integration.py::test_integer_answer_flow PASSED [ 16%]
tests/test_verifier_integration.py::test_float_answer_flow PASSED [ 33%]
tests/test_verifier_integration.py::test_string_answer_flow PASSED [ 50%]
tests/test_verifier_integration.py::test_list_answer_flow PASSED [ 66%]
tests/test_verifier_integration.py::test_fallback_when_answer_type_missing PASSED [ 83%]
tests/test_verifier_integration.py::test_type_coercion_failure_returns_zero_reward PASSED [100%]
============================== 6 passed in 7.92s ===============================
What to notice: the flow covers all core answer types plus fallback and failure behavior in one environment-facing test surface.
Confirm happy-path matching behavior for core answer types
This run checks representative dispatcher-level happy cases.
uv run pytest tests/test_verifier.py -v -k "test_verify_integer_exact_match or test_verify_float_within_tolerance or test_verify_string_case_insensitive or test_verify_list_order_insensitive"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification
configfile: pyproject.toml
plugins: cov-7.1.0, anyio-4.13.0
collecting ... collected 34 items / 30 deselected / 4 selected
tests/test_verifier.py::test_verify_integer_exact_match PASSED [ 25%]
tests/test_verifier.py::test_verify_float_within_tolerance PASSED [ 50%]
tests/test_verifier.py::test_verify_string_case_insensitive PASSED [ 75%]
tests/test_verifier.py::test_verify_list_order_insensitive PASSED [100%]
======================= 4 passed, 30 deselected in 7.87s =======================
What to notice: each answer type has at least one direct pass case that aligns to the feature’s success criteria.
Existing Evidence
- Prior full regression evidence (not re-run in this demo):
uv run pytest tests/ -v=> 65 passed (specs/FEATURES.json, F002 verification evidence).
Manual Verification Checklist
- Start from a clean shell in project root and run
uv sync. - Execute
uv run pytest tests/test_verifier_integration.py -vand confirm all 6 integration tests pass. - Execute the happy-path dispatcher command from Quickstart and confirm 4 selected tests pass.
- Optionally run
uv run pytest tests/ -vto confirm no regressions outside F002.
Edge Cases Exercised
Empty predicted answer is rejected
uv run pytest tests/test_verifier.py -v -k "test_verify_empty_predicted_returns_false"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification
configfile: pyproject.toml
plugins: cov-7.1.0, anyio-4.13.0
collecting ... collected 34 items / 33 deselected / 1 selected
tests/test_verifier.py::test_verify_empty_predicted_returns_false PASSED [100%]
======================= 1 passed, 33 deselected in 7.83s =======================
This matters because blank answers should fail deterministically rather than being ambiguously normalized.
Float tolerance boundary and non-numeric rejection
uv run pytest tests/test_verifier.py -v -k "_compare_float"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification
configfile: pyproject.toml
plugins: cov-7.1.0, anyio-4.13.0
collecting ... collected 34 items / 26 deselected / 8 selected
tests/test_verifier.py::test_compare_float_exact_match PASSED [ 12%]
tests/test_verifier.py::test_compare_float_within_1pct_tolerance PASSED [ 25%]
tests/test_verifier.py::test_compare_float_outside_1pct_tolerance PASSED [ 37%]
tests/test_verifier.py::test_compare_float_boundary_exactly_1pct PASSED [ 50%]
tests/test_verifier.py::test_compare_float_just_over_1pct PASSED [ 62%]
tests/test_verifier.py::test_compare_float_gold_zero_uses_absolute_tolerance PASSED [ 75%]
tests/test_verifier.py::test_compare_float_gold_zero_fails_large_diff PASSED [ 87%]
tests/test_verifier.py::test_compare_float_non_numeric_returns_false PASSED [100%]
======================= 8 passed, 26 deselected in 7.10s =======================
This matters because it validates both tolerant matching and strict rejection when values exceed tolerance or are invalid.
Type coercion failure returns zero reward in integration flow
uv run pytest tests/test_verifier_integration.py -v -k "type_coercion_failure"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification
configfile: pyproject.toml
plugins: cov-7.1.0, anyio-4.13.0
collecting ... collected 6 items / 5 deselected / 1 selected
tests/test_verifier_integration.py::test_type_coercion_failure_returns_zero_reward PASSED [100%]
======================= 1 passed, 5 deselected in 6.95s ========================
This matters because invalid numeric answers fail cleanly without crashing the answer flow.
Test Evidence (Optional)
Supplementary proof that the feature works correctly across all scenarios. The Live Demo section above shows how to use the feature; this section shows it was tested.
| Test Suite | Tests | Status |
|---|---|---|
tests/test_verifier_integration.py |
6 | All passed |
tests/test_verifier.py selected happy-path dispatcher tests |
4 selected | All passed |
tests/test_verifier.py selected float edge/error tests |
8 selected | All passed |
tests/test_smoke.py selected ANSWER compatibility test |
1 selected | All passed |
Representative command:
uv run pytest tests/test_smoke.py -v -k "answer"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification
configfile: pyproject.toml
plugins: cov-7.1.0, anyio-4.13.0
collecting ... collected 25 items / 24 deselected / 1 selected
tests/test_smoke.py::TestEnvironment::test_answer_ends_episode_without_budget_decrement PASSED [100%]
======================= 1 passed, 24 deselected in 7.49s =======================
Feature Links
- Implementation spec:
specs/F002-IMPLEMENTATION_SPEC.md - Verification spec:
specs/F002-VERIFICATION_SPEC.md
Demo generated by feature-demo agent. Re-run with /feature-demo F002 to refresh.