sql_env / specs /F002-DEMO.md
hjerpe's picture
Upload folder using huggingface_hub
5dd1bb4 verified

Feature Demo: F002 — Answer Verification

Generated: 2026-03-27T22:37:50Z Context source: spec + discovery only (implementation not read) Feature entry: FEATURES.json #F002


What This Feature Does

When an agent submits an ANSWER, this feature makes the final pass/fail decision robust to common formatting and representation differences. From a user perspective, it reduces false negatives where the agent is semantically correct but uses a different format (for example numeric formatting differences, casing differences, or reordered list values).

The intended experience is clear and predictable scoring: tolerant float matching, order-insensitive list matching, and unambiguous terminal reward outcomes with fewer frustrating “technically wrong but practically right” rejections.


What Is Already Proven

Verified in This Demo Run

  • Happy-path typed verification scenarios pass for integer, float, string, and list dispatch paths.
  • Full integration flow through environment step() passes for integer/float/string/list and fallback behavior.
  • Edge and error behavior is exercised locally: empty predicted input fails, float tolerance boundary checks pass/fail correctly, and integer coercion failure returns zero reward.
  • Existing smoke coverage for answer episode termination still passes.

Previously Verified Evidence

  • specs/FEATURES.json (F002.verification_evidence) records verifier-approved run: uv run pytest tests/ -v with 65/65 passed at 2026-03-27T22:33:12Z.
  • specs/F002-IMPLEMENTATION_SPEC.md Section 7 records completed step evidence including full suite pass and integration pass.

What Still Needs User Verification

  • Run one manual episode in your target runtime (your exact dataset/runtime environment) and submit a known-correct ANSWER with alternate formatting (for example 42.0 vs 42) to confirm behavior in your end-to-end setup.

Quickstart / Verification Steps

Run these commands to see the feature in action:

uv run pytest tests/test_verifier_integration.py -v
uv run pytest tests/test_verifier.py -v -k "test_verify_integer_exact_match or test_verify_float_within_tolerance or test_verify_string_case_insensitive or test_verify_list_order_insensitive"

Prerequisite: dependencies installed via uv sync.


Live Local Proof

Validate typed ANSWER handling through environment flow

This runs the integration scenarios that exercise answer verification via the real environment step flow.

uv run pytest tests/test_verifier_integration.py -v
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification
configfile: pyproject.toml
plugins: cov-7.1.0, anyio-4.13.0
collecting ... collected 6 items

tests/test_verifier_integration.py::test_integer_answer_flow PASSED      [ 16%]
tests/test_verifier_integration.py::test_float_answer_flow PASSED        [ 33%]
tests/test_verifier_integration.py::test_string_answer_flow PASSED       [ 50%]
tests/test_verifier_integration.py::test_list_answer_flow PASSED         [ 66%]
tests/test_verifier_integration.py::test_fallback_when_answer_type_missing PASSED [ 83%]
tests/test_verifier_integration.py::test_type_coercion_failure_returns_zero_reward PASSED [100%]

============================== 6 passed in 7.92s ===============================

What to notice: the flow covers all core answer types plus fallback and failure behavior in one environment-facing test surface.

Confirm happy-path matching behavior for core answer types

This run checks representative dispatcher-level happy cases.

uv run pytest tests/test_verifier.py -v -k "test_verify_integer_exact_match or test_verify_float_within_tolerance or test_verify_string_case_insensitive or test_verify_list_order_insensitive"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification
configfile: pyproject.toml
plugins: cov-7.1.0, anyio-4.13.0
collecting ... collected 34 items / 30 deselected / 4 selected

tests/test_verifier.py::test_verify_integer_exact_match PASSED           [ 25%]
tests/test_verifier.py::test_verify_float_within_tolerance PASSED        [ 50%]
tests/test_verifier.py::test_verify_string_case_insensitive PASSED       [ 75%]
tests/test_verifier.py::test_verify_list_order_insensitive PASSED        [100%]

======================= 4 passed, 30 deselected in 7.87s =======================

What to notice: each answer type has at least one direct pass case that aligns to the feature’s success criteria.


Existing Evidence

  • Prior full regression evidence (not re-run in this demo): uv run pytest tests/ -v => 65 passed (specs/FEATURES.json, F002 verification evidence).

Manual Verification Checklist

  1. Start from a clean shell in project root and run uv sync.
  2. Execute uv run pytest tests/test_verifier_integration.py -v and confirm all 6 integration tests pass.
  3. Execute the happy-path dispatcher command from Quickstart and confirm 4 selected tests pass.
  4. Optionally run uv run pytest tests/ -v to confirm no regressions outside F002.

Edge Cases Exercised

Empty predicted answer is rejected

uv run pytest tests/test_verifier.py -v -k "test_verify_empty_predicted_returns_false"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification
configfile: pyproject.toml
plugins: cov-7.1.0, anyio-4.13.0
collecting ... collected 34 items / 33 deselected / 1 selected

tests/test_verifier.py::test_verify_empty_predicted_returns_false PASSED [100%]

======================= 1 passed, 33 deselected in 7.83s =======================

This matters because blank answers should fail deterministically rather than being ambiguously normalized.

Float tolerance boundary and non-numeric rejection

uv run pytest tests/test_verifier.py -v -k "_compare_float"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification
configfile: pyproject.toml
plugins: cov-7.1.0, anyio-4.13.0
collecting ... collected 34 items / 26 deselected / 8 selected

tests/test_verifier.py::test_compare_float_exact_match PASSED            [ 12%]
tests/test_verifier.py::test_compare_float_within_1pct_tolerance PASSED  [ 25%]
tests/test_verifier.py::test_compare_float_outside_1pct_tolerance PASSED [ 37%]
tests/test_verifier.py::test_compare_float_boundary_exactly_1pct PASSED  [ 50%]
tests/test_verifier.py::test_compare_float_just_over_1pct PASSED         [ 62%]
tests/test_verifier.py::test_compare_float_gold_zero_uses_absolute_tolerance PASSED [ 75%]
tests/test_verifier.py::test_compare_float_gold_zero_fails_large_diff PASSED [ 87%]
tests/test_verifier.py::test_compare_float_non_numeric_returns_false PASSED [100%]

======================= 8 passed, 26 deselected in 7.10s =======================

This matters because it validates both tolerant matching and strict rejection when values exceed tolerance or are invalid.

Type coercion failure returns zero reward in integration flow

uv run pytest tests/test_verifier_integration.py -v -k "type_coercion_failure"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification
configfile: pyproject.toml
plugins: cov-7.1.0, anyio-4.13.0
collecting ... collected 6 items / 5 deselected / 1 selected

tests/test_verifier_integration.py::test_type_coercion_failure_returns_zero_reward PASSED [100%]

======================= 1 passed, 5 deselected in 6.95s ========================

This matters because invalid numeric answers fail cleanly without crashing the answer flow.


Test Evidence (Optional)

Supplementary proof that the feature works correctly across all scenarios. The Live Demo section above shows how to use the feature; this section shows it was tested.

Test Suite Tests Status
tests/test_verifier_integration.py 6 All passed
tests/test_verifier.py selected happy-path dispatcher tests 4 selected All passed
tests/test_verifier.py selected float edge/error tests 8 selected All passed
tests/test_smoke.py selected ANSWER compatibility test 1 selected All passed

Representative command:

uv run pytest tests/test_smoke.py -v -k "answer"
============================= test session starts ==============================
platform darwin -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /Users/hjerp/Projects/sql-env-F002-answer-verification/.venv/bin/python3
cachedir: .pytest_cache
rootdir: /Users/hjerp/Projects/sql-env-F002-answer-verification
configfile: pyproject.toml
plugins: cov-7.1.0, anyio-4.13.0
collecting ... collected 25 items / 24 deselected / 1 selected

tests/test_smoke.py::TestEnvironment::test_answer_ends_episode_without_budget_decrement PASSED [100%]

======================= 1 passed, 24 deselected in 7.49s =======================

Feature Links

  • Implementation spec: specs/F002-IMPLEMENTATION_SPEC.md
  • Verification spec: specs/F002-VERIFICATION_SPEC.md

Demo generated by feature-demo agent. Re-run with /feature-demo F002 to refresh.