math_trainer / VALIDATION_LOG.md
NorthernTribe-Research's picture
Switch Space trainer defaults to math_conjecture_sota profile and remove DeepSeek references
9a4f619 verified

A newer version of the Gradio SDK is available: 6.11.0

Upgrade

Space Trainer Validation Log

Date (UTC): 2026-02-28 10:24:36 UTC

Scope Reviewed

Reviewed the full space_trainer/ implementation surface used by the Hugging Face Space runtime:

  • space_trainer/app.py
  • space_trainer/README.md
  • space_trainer/PRODUCTION.md
  • space_trainer/.env.example
  • space_trainer/requirements.txt
  • space_trainer/configs/math_conjecture_sota.yaml
  • space_trainer/scripts/preflight_check.py
  • space_trainer/scripts/train_sota.py
  • space_trainer/scripts/eval_sota.py
  • space_trainer/tests/test_core_utils.py
  • Existing workspace runtime/run artifacts under space_trainer/workspace/

Issues Found

  1. UI result badge mapping treated preflight passed as neutral because _ was converted to spaces before class lookup.
  2. Unit tests failed when run from repository root due import path assumptions (ModuleNotFoundError: app).

Fixes Applied

  1. space_trainer/app.py
  • Normalized run result strings in _run_result_badge_class() to handle underscore/space/hyphen variants.
  • Updated recent-runs badge rendering to classify by raw result key and only prettify the display label.
  • Kept Gradio theme/css/head in launch() (Gradio 6.6 recommended path), and set queue configuration once at module load with demo.queue(default_concurrency_limit=1).
  1. space_trainer/tests/test_core_utils.py
  • Added deterministic sys.path insertion for space_trainer/ root so tests pass from both:
    • repo root (python -m unittest discover -s space_trainer/tests -v)
    • space_trainer/ directory (python -m unittest discover -s tests -v)
  • Added regression test for preflight badge-class normalization.

Validation Commands and Results

  1. Preflight checks:
  • Command: .venv/bin/python space_trainer/scripts/preflight_check.py --json
  • Result: PASS ("ok": true)
  1. Unit tests from repo root:
  • Command: .venv/bin/python -m unittest discover -s space_trainer/tests -v
  • Result: PASS (Ran 15 tests, OK)
  1. Unit tests from space_trainer/:
  • Command: ../.venv/bin/python -m unittest discover -s tests -v
  • Result: PASS (Ran 15 tests, OK)
  1. Python syntax compile check:
  • Command: ../.venv/bin/python -m py_compile app.py scripts/preflight_check.py scripts/train_sota.py scripts/eval_sota.py tests/test_core_utils.py
  • Result: PASS
  1. Gradio app object/config smoke check:
  • Command: ../.venv/bin/python - <<'PY' ... app.demo.get_config_file() ... PY
  • Result: PASS (mode=blocks, components=44, dependencies=3, queue_set=True)

Environment Notes

  • CUDA warning appears in this environment (cudaGetDeviceCount OS unsupported). This is expected on non-GPU hosts and handled by app CPU fallback logic.
  • Fast tokenizer fallback warning (protobuf missing) is already handled by project fallback code and validated by tests.
  • Direct local app.py server launch in this sandbox cannot bind any Gradio ports (Cannot find empty port...). This is an execution-environment limitation, not a code-level validation failure.

Current Status

  • UI telemetry classification bug fixed.
  • Test reliability improved.
  • Preflight + tests + compile checks are passing.
  • Space runtime code path is consistent and ready for deployment validation inside Hugging Face Spaces.

Rewrite Session

Date (UTC): 2026-02-28 11:56:17 UTC

Objective

  • Reprogram app.py from scratch.
  • Switch UI to a full monochrome theme.
  • Preserve full end-to-end pipeline functionality in a newly structured implementation.

Implementation Summary

  • Replaced space_trainer/app.py entirely with a new architecture and new UI/CSS/HTML structure.
  • Kept all major operational capabilities:
    • dataset download and cache handling
    • runtime config generation
    • staged training subprocess orchestration
    • optional post-training evaluation fallback path
    • quality gate + push status surfacing
    • continuous auto-restart with cooldown and circuit breaker
    • cancellation controls
    • run history persistence and recent-runs panel
  • Kept compatibility for existing tests and tooling contracts (e.g., helper function names used by tests and preflight checks).

Monochrome Redesign

  • New monochrome command-center visual language with grayscale-only palette.
  • New telemetry card layout, stage timeline, recent-runs view, and loss sparkline styling.
  • New hero header and runtime timestamp script in UI_HEAD.

Verification Executed

  1. Syntax check:
  • ../.venv/bin/python -m py_compile app.py
  • Result: PASS
  1. Preflight:
  • ../.venv/bin/python scripts/preflight_check.py --json
  • Result: PASS ("ok": true)
  1. Unit tests:
  • ../.venv/bin/python -m unittest discover -s tests -v
  • Result: PASS (Ran 15 tests, OK)
  1. Gradio config smoke check:
  • ../.venv/bin/python - <<'PY' ... app.demo.get_config_file() ... PY
  • Result: PASS (mode=blocks, components=44, dependencies=3, stage_count=4)

Footer + Continuous Enforcement Session

Date (UTC): 2026-02-28 12:45:36 UTC

Requested Changes

  • Remove default Gradio footer controls (Use via API, logo, settings) from footer area.
  • Place API/settings access in a better UI location.
  • Ensure training runs in continuous mode.

Implementation

  1. Footer controls removed from Gradio launch:
  • Added footer_links=[] in demo.launch(...).
  1. API/settings moved into hero section:
  • Added .mono-link-row with:
    • /gradio_api/docs
    • https://huggingface.co/spaces/NorthernTribe-Research/math_trainer/settings
  • Added matching CSS styles for the new header links.
  1. Continuous mode enforced:
  • Runtime enforcement in run_pipeline(...):
    • continuous_mode = not bool(preflight_only)
  • UI control locked to enforced-on:
    • Continuous Auto-Restart (Enforced) with interactive=False.

Verification

  • ../.venv/bin/python -m py_compile app.py -> PASS
  • ../.venv/bin/python scripts/preflight_check.py --json -> PASS
  • ../.venv/bin/python -m unittest discover -s tests -v -> PASS (Ran 15 tests, OK)

Deployment

  • Space: NorthernTribe-Research/math_trainer
  • Commit: c8a24f966d710173764da0355e56632af9e66c40
  • Runtime after deploy: RUNNING
  • https://northerntribe-research-math-trainer.hf.space/config -> 200 JSON