Spaces:

NorthernTribe-Research
/

math_trainer

Running

App Files Files Community

NorthernTribe-Research commited on Feb 27

Commit

f0734c2

verified ·

1 Parent(s): 9476b56

Production UI/telemetry upgrade + monochrome theme + safety hardening

Browse files

Files changed (9) hide show

.env.example +14 -0
PRODUCTION.md +51 -0
README.md +37 -1
app.py +704 -143
requirements.txt +1 -0
scripts/eval_sota.py +7 -1
scripts/preflight_check.py +204 -0
scripts/train_sota.py +30 -6
tests/test_core_utils.py +246 -0

.env.example ADDED Viewed

	@@ -0,0 +1,14 @@

+# Hugging Face authentication (set at runtime; do not commit real tokens)
+HF_TOKEN=
+# HUGGINGFACE_HUB_TOKEN=
+# Continuous mode safety controls
+# Seconds to wait between auto-restart cycles.
+CONTINUOUS_RESTART_DELAY_SECONDS=15
+# Stop continuous mode after this many consecutive non-success cycles.
+CONTINUOUS_MAX_CONSECUTIVE_FAILURES=3
+# Max retained characters in live app log buffer.
+APP_LOG_MAX_CHARS=200000
+# Max number of entries retained in workspace/runtime/run_history.json.
+RUN_HISTORY_LIMIT=80

PRODUCTION.md ADDED Viewed

	@@ -0,0 +1,51 @@

+# Production Runbook
+## 1. Pre-Deploy Checks
+Run all checks from `space_trainer/`:
+```bash
+python scripts/preflight_check.py
+python -m unittest discover -s tests -v
+```
+Optional deeper check (loads tokenizer/model dependencies and runs stage-1 dry-run):
+```bash
+python scripts/preflight_check.py --run-training-dry-run
+```
+## 2. Runtime Configuration
+Set runtime secrets in Hugging Face Space settings:
+- `HF_TOKEN` (or `HUGGINGFACE_HUB_TOKEN`)
+Optional safety overrides:
+- `CONTINUOUS_RESTART_DELAY_SECONDS` (default `15`)
+- `CONTINUOUS_MAX_CONSECUTIVE_FAILURES` (default `3`)
+- `APP_LOG_MAX_CHARS` (default `200000`)
+- `RUN_HISTORY_LIMIT` (default `80`)
+## 3. Release Checklist
+1. Ensure pre-deploy checks are green.
+2. Ensure `requirements.txt` includes all runtime dependencies.
+3. Deploy Space files (exclude `workspace/` artifacts).
+4. Wait for Space runtime stage to reach `RUNNING`.
+5. Trigger a UI preflight run (`Validation Mode (No Training)`).
+6. Trigger one non-autonomous single-stage run before enabling continuous autonomous mode.
+7. Confirm `workspace/runtime/run_history.json` is being updated and recent run cards render in telemetry.
+## 4. Rollback Strategy
+1. Re-deploy the last known good commit to the Space.
+2. Disable `Continuous Auto-Restart`.
+3. Run preflight mode only until health is restored.
+4. Re-enable autonomous/continuous mode after one successful full run.
+## 5. Operational Notes
+- Full run records are persisted under `workspace/runtime/run_records/`.
+- The compact run index at `workspace/runtime/run_history.json` is capped by `RUN_HISTORY_LIMIT`.

README.md CHANGED Viewed

@@ -39,7 +39,7 @@ Credentials and publish permissions are handled by deployment runtime settings.
 - `Run Evaluation After Training`: toggles post-train eval in runtime config.
 - `Enforce Quality Gate`: enables/disables promotion gate checks.
 - `Gate Min pass@1`, `Gate Min pass@k`, `Gate Min Rows`: runtime gate thresholds.
-- `Live Telemetry`: real-time stage progression, runtime posture, and training-loss graph (sparkline) with gate/push state.
 - `Run Log (Live Log + Summary JSON)`: unified panel for line-by-line runtime stream, heartbeats, and structured run summary.
 - `Validation Mode (No Training)`: validates pipeline with `--dry-run`.
 - `Force Dataset Redownload`: bypasses cached parquet files.
@@ -52,8 +52,44 @@ Credentials and publish permissions are handled by deployment runtime settings.
 - final adapter: `workspace/runs/math-conjecture-sota/final_adapter`
 - training summary: `workspace/runs/math-conjecture-sota/training_summary.json`
 - post-eval report: `workspace/runs/math-conjecture-sota/post_eval_report.json`
 ## Notes
 - Full training runs on GPU when available and automatically falls back to CPU mode when CUDA is unavailable.
 - App handles Gradio copy-button compatibility across versions automatically.

 - `Run Evaluation After Training`: toggles post-train eval in runtime config.
 - `Enforce Quality Gate`: enables/disables promotion gate checks.
 - `Gate Min pass@1`, `Gate Min pass@k`, `Gate Min Rows`: runtime gate thresholds.
+- `Live Telemetry`: real-time stage progression, runtime posture, training-loss graph (sparkline), artifact index, and recent-run outcomes.
 - `Run Log (Live Log + Summary JSON)`: unified panel for line-by-line runtime stream, heartbeats, and structured run summary.
 - `Validation Mode (No Training)`: validates pipeline with `--dry-run`.
 - `Force Dataset Redownload`: bypasses cached parquet files.
 - final adapter: `workspace/runs/math-conjecture-sota/final_adapter`
 - training summary: `workspace/runs/math-conjecture-sota/training_summary.json`
 - post-eval report: `workspace/runs/math-conjecture-sota/post_eval_report.json`
+- run history index: `workspace/runtime/run_history.json`
+- per-run records: `workspace/runtime/run_records/<run_label>.json`
 ## Notes
 - Full training runs on GPU when available and automatically falls back to CPU mode when CUDA is unavailable.
 - App handles Gradio copy-button compatibility across versions automatically.
+## Production Hardening
+Before deployment, run:
+```bash
+python scripts/preflight_check.py
+python -m unittest discover -s tests -v
+```
+Optional deeper validation (runs stage-1 dry-run training check):
+```bash
+python scripts/preflight_check.py --run-training-dry-run
+```
+Continuous mode now includes two production fail-safes:
+- restart cooldown between cycles (default `15s`)
+- circuit breaker that stops after consecutive non-success cycles (default `3`)
+Environment overrides:
+- `CONTINUOUS_RESTART_DELAY_SECONDS` (integer, `>=0`)
+- `CONTINUOUS_MAX_CONSECUTIVE_FAILURES` (integer, `>=1`)
+- `APP_LOG_MAX_CHARS` (integer, `>=20000`; default `200000`)
+- `RUN_HISTORY_LIMIT` (integer, `>=5`; default `80`)
+Recommended runtime secrets posture:
+- set `HF_TOKEN` / `HUGGINGFACE_HUB_TOKEN` from Space secrets
+- avoid storing long-lived API tokens in repository files
+Detailed deployment/rollback steps are documented in `PRODUCTION.md`.

app.py CHANGED Viewed

@@ -16,6 +16,7 @@ import subprocess
 import sys
 import threading
 import time
 from pathlib import Path
 from typing import Any, Dict, Generator, List, Optional, Tuple
@@ -29,6 +30,9 @@ ROOT = Path(__file__).resolve().parent
 WORKSPACE_DIR = ROOT / "workspace"
 DATA_DIR = WORKSPACE_DIR / "data" / "releases" / "v1"
 RUNTIME_DIR = WORKSPACE_DIR / "runtime"
 CONFIG_TEMPLATE = ROOT / "configs" / "deepseek_math_sota.yaml"
 TRAIN_SCRIPT = ROOT / "scripts" / "train_sota.py"
 EVAL_SCRIPT = ROOT / "scripts" / "eval_sota.py"
@@ -48,6 +52,26 @@ CANCEL_REQUESTED = False
 ACTIVE_PROCESS: Optional[subprocess.Popen] = None
 ACTIVE_RUN_LABEL = ""
 PROJECT_DESCRIPTION = """
 ## Pipeline Scope
 This app runs the full training pipeline for the `maths-conjuncture-solutions` project.
@@ -104,131 +128,131 @@ UI_CSS = """
 @import url("https://fonts.googleapis.com/css2?family=Rajdhani:wght@500;600;700&family=IBM+Plex+Mono:wght@400;500&display=swap");
 :root {
-  --nt-bg: #05090d;
-  --nt-bg-2: #0a1118;
-  --nt-bg-radial: rgba(88, 122, 159, 0.16);
-  --nt-panel: #0e151d;
-  --nt-panel-2: #111a24;
-  --nt-border: #243546;
-  --nt-border-strong: #3b566f;
-  --nt-text: #d9e6f2;
-  --nt-muted: #8fa6bc;
-  --nt-accent: #84aacb;
-  --nt-accent-2: #bed4e8;
-  --nt-shadow: rgba(0, 0, 0, 0.52);
-  --nt-hero-border: #4e6780;
-  --nt-hero-bg-1: #101720;
-  --nt-hero-bg-2: #182634;
-  --nt-hero-bg-3: #23364a;
-  --nt-hero-text: #eef5fc;
-  --nt-hero-topline: #a5bdd2;
-  --nt-hero-body: #d3e1ef;
-  --nt-hero-grid: rgba(130, 169, 205, 0.14);
-  --nt-chip-border: rgba(164, 194, 221, 0.42);
-  --nt-chip-bg: rgba(8, 15, 23, 0.66);
-  --nt-chip-text: #ddeaf6;
-  --nt-input-bg: #0a1219;
-  --nt-focus-ring: rgba(131, 176, 218, 0.26);
-  --nt-btn-border: #425f79;
-  --nt-btn-bg-1: #1a2734;
-  --nt-btn-bg-2: #111c28;
-  --nt-btn-text: #d9e8f5;
-  --nt-primary-btn-border: #6f93b8;
-  --nt-primary-btn-bg-1: #446381;
-  --nt-primary-btn-bg-2: #2c435a;
-  --nt-primary-btn-text: #f6fbff;
-  --nt-stop-btn-border: #85714b;
-  --nt-stop-btn-bg-1: #715f3f;
-  --nt-stop-btn-bg-2: #554731;
-  --nt-stop-btn-text: #f8f3e7;
-  --nt-ops-panel-1: #121c27;
-  --nt-ops-panel-2: #0d151f;
-  --nt-card-border: #2f4458;
-  --nt-card-bg-1: #172332;
-  --nt-card-bg-2: #111c28;
-  --nt-metric-text: #e6f0f9;
-  --nt-meter-border: #4a637b;
-  --nt-meter-bg: #1e2f3f;
-  --nt-meter-fill-1: #8db1d3;
-  --nt-meter-fill-2: #5f85ad;
-  --nt-spark-border: #425e78;
-  --nt-spark-bg: #0c1824;
-  --nt-spark-stroke: #9dc3e6;
-  --nt-settings-border: rgba(161, 191, 218, 0.48);
-  --nt-settings-bg: rgba(7, 13, 19, 0.64);
-  --nt-settings-bg-hover: rgba(8, 14, 22, 0.85);
-  --nt-settings-text: #e5f0fa;
-  --nt-settings-menu-bg: #111d2a;
-  --nt-settings-menu-border: #49647e;
-  --nt-settings-menu-shadow: rgba(0, 0, 0, 0.58);
-  --nt-settings-menu-btn: #e5f1fc;
-  --nt-settings-menu-btn-hover: #26394e;
-  --nt-settings-menu-btn-active-bg: #7ba0c4;
-  --nt-settings-menu-btn-active-text: #0f2336;
 }
 :root[data-nt-theme="light"] {
-  --nt-bg: #d5dde6;
-  --nt-bg-2: #c5d0dc;
-  --nt-bg-radial: rgba(255, 255, 255, 0.65);
-  --nt-panel: #edf2f8;
-  --nt-panel-2: #e1e8f0;
-  --nt-border: #9cadbe;
-  --nt-border-strong: #6e879f;
-  --nt-text: #122231;
-  --nt-muted: #3f566d;
-  --nt-accent: #35506c;
-  --nt-accent-2: #253d54;
-  --nt-shadow: rgba(36, 53, 73, 0.18);
-  --nt-hero-border: #3f556d;
-  --nt-hero-bg-1: #202f3f;
-  --nt-hero-bg-2: #2a3f54;
-  --nt-hero-bg-3: #35506a;
-  --nt-hero-text: #f3f8fd;
-  --nt-hero-topline: #d3dfeb;
-  --nt-hero-body: #e0eaf4;
-  --nt-hero-grid: rgba(189, 214, 236, 0.18);
-  --nt-chip-border: rgba(213, 227, 238, 0.52);
-  --nt-chip-bg: rgba(18, 30, 44, 0.5);
-  --nt-chip-text: #ecf4fb;
-  --nt-input-bg: #f7fafc;
-  --nt-focus-ring: rgba(54, 81, 108, 0.22);
-  --nt-btn-border: #667f98;
-  --nt-btn-bg-1: #eef3f8;
-  --nt-btn-bg-2: #dce5ee;
-  --nt-btn-text: #1a2d41;
-  --nt-primary-btn-border: #3e5e7d;
-  --nt-primary-btn-bg-1: #4b6f92;
-  --nt-primary-btn-bg-2: #35506d;
-  --nt-primary-btn-text: #f6fbff;
-  --nt-stop-btn-border: #866f47;
-  --nt-stop-btn-bg-1: #8d7850;
-  --nt-stop-btn-bg-2: #6c5a3b;
-  --nt-stop-btn-text: #fff8ea;
-  --nt-ops-panel-1: #eff4f9;
-  --nt-ops-panel-2: #dde6ee;
-  --nt-card-border: #98adbf;
-  --nt-card-bg-1: #f7fbff;
-  --nt-card-bg-2: #e5edf5;
-  --nt-metric-text: #172a3d;
-  --nt-meter-border: #7e98b0;
-  --nt-meter-bg: #c7d5e2;
-  --nt-meter-fill-1: #486988;
-  --nt-meter-fill-2: #2c455f;
-  --nt-spark-border: #89a1b6;
-  --nt-spark-bg: #eef5fb;
-  --nt-spark-stroke: #355271;
-  --nt-settings-border: rgba(226, 236, 247, 0.72);
-  --nt-settings-bg: rgba(29, 45, 64, 0.52);
-  --nt-settings-bg-hover: rgba(22, 36, 53, 0.78);
-  --nt-settings-text: #f2f7fc;
-  --nt-settings-menu-bg: #e8eff7;
-  --nt-settings-menu-border: #869db2;
-  --nt-settings-menu-shadow: rgba(20, 35, 50, 0.24);
-  --nt-settings-menu-btn: #1b3147;
-  --nt-settings-menu-btn-hover: #cedae7;
-  --nt-settings-menu-btn-active-bg: #355271;
-  --nt-settings-menu-btn-active-text: #f4f9ff;
 }
 @keyframes ntFadeUp {
@@ -246,8 +270,8 @@ UI_CSS = """
   color: var(--nt-text) !important;
   font-family: "Rajdhani", "Segoe UI", sans-serif !important;
   background:
-    linear-gradient(rgba(126, 165, 200, 0.06) 1px, transparent 1px),
-    linear-gradient(90deg, rgba(126, 165, 200, 0.06) 1px, transparent 1px),
     radial-gradient(circle at 12% 10%, var(--nt-bg-radial) 0%, rgba(255, 255, 255, 0) 44%),
     linear-gradient(145deg, var(--nt-bg) 0%, var(--nt-bg-2) 100%) !important;
   background-size: 24px 24px, 24px 24px, 100% 100%, 100% 100% !important;
@@ -266,7 +290,7 @@ UI_CSS = """
   background: linear-gradient(180deg, var(--nt-panel) 0%, var(--nt-panel-2) 100%) !important;
   border: 1px solid var(--nt-border) !important;
   border-radius: 8px !important;
-  box-shadow: inset 0 0 0 1px rgba(143, 174, 201, 0.05), 0 12px 26px var(--nt-shadow) !important;
 }
 .app-hero {
@@ -277,7 +301,7 @@ UI_CSS = """
   background: linear-gradient(132deg, var(--nt-hero-bg-1) 0%, var(--nt-hero-bg-2) 62%, var(--nt-hero-bg-3) 100%);
   color: var(--nt-hero-text);
   padding: 22px 24px 20px;
-  box-shadow: inset 0 0 0 1px rgba(170, 199, 224, 0.06), 0 18px 38px rgba(0, 0, 0, 0.38);
   animation: ntFadeUp 500ms ease-out both;
 }
@@ -425,7 +449,7 @@ UI_CSS = """
 .nt-settings-menu .nt-mode-btn.active {
   background: var(--nt-settings-menu-btn-active-bg) !important;
   color: var(--nt-settings-menu-btn-active-text) !important;
-  box-shadow: inset 0 0 0 1px rgba(234, 244, 253, 0.25) !important;
 }
 .section-copy {
@@ -524,7 +548,7 @@ UI_CSS = """
 .gradio-container button:hover {
   transform: translateY(-1px);
-  box-shadow: 0 0 0 1px rgba(167, 198, 226, 0.24), 0 10px 20px rgba(0, 0, 0, 0.35);
   filter: brightness(1.03);
 }
@@ -535,7 +559,7 @@ UI_CSS = """
   background: linear-gradient(180deg, var(--nt-ops-panel-1) 0%, var(--nt-ops-panel-2) 100%);
   padding: 14px;
   animation: ntFadeUp 420ms ease-out both;
-  box-shadow: inset 0 0 0 1px rgba(141, 172, 199, 0.05), 0 12px 28px rgba(0, 0, 0, 0.34);
 }
 .ops-visual-head {
@@ -575,7 +599,11 @@ UI_CSS = """
   padding: 9px 10px;
   min-height: 86px;
   animation: ntFadeUp 460ms ease-out both;
-  box-shadow: inset 0 0 0 1px rgba(149, 180, 206, 0.04);
 }
 .ops-k {
@@ -636,6 +664,128 @@ UI_CSS = """
   display: block;
 }
 .gradio-container [data-testid="footer"],
 .gradio-container .built-with {
   display: none !important;
@@ -691,6 +841,10 @@ UI_CSS = """
   .ops-grid {
     grid-template-columns: 1fr;
   }
 }
 """
@@ -963,14 +1117,14 @@ DEFAULT_AUTO_PUSH_TO_HUB = bool(TEMPLATE_HUB.get("push_to_hub", True))
 def now_ts() -> str:
-    return dt.datetime.utcnow().strftime("%Y-%m-%d %H:%M:%S UTC")
 def append_log(lines: List[str], message: str) -> str:
     lines.append(f"[{now_ts()}] {message}")
     text = "\n".join(lines)
-    if len(text) > 200_000:
-        text = text[-200_000:]
     return text
@@ -991,7 +1145,7 @@ def compose_ops_console(log_text: str, summary_json: str) -> str:
     return payload
-def _merge_log_chunk(existing: str, chunk: str, max_chars: int = 200_000) -> str:
     if not chunk:
         return existing
     merged = existing
@@ -1031,6 +1185,246 @@ def _fmt_float(value: Any, digits: int = 3) -> str:
         return "--"
 def _extract_loss_values(log_text: str, limit: int = 48) -> List[float]:
     losses: List[float] = []
     for line in log_text.splitlines():
@@ -1137,18 +1531,67 @@ def _infer_stage_snapshot(summary: Dict[str, Any], log_text: str) -> Dict[str, A
     }
 def render_ops_visual(summary: Dict[str, Any], status_text: str, log_text: str) -> str:
     safe_summary = _as_dict(summary)
     runtime = _as_dict(safe_summary.get("runtime"))
     quality_gate = _as_dict(safe_summary.get("quality_gate"))
     evaluation = _as_dict(safe_summary.get("evaluation"))
     push_report = _as_dict(safe_summary.get("push"))
     run_label = html.escape(str(safe_summary.get("run_label") or "not-started"))
     status_value = html.escape(status_text or "Idle")
     runtime_mode = "GPU READY" if runtime.get("cuda_available") else "CPU FALLBACK"
     runtime_mode = html.escape(runtime_mode)
     device_count = _safe_int(runtime.get("cuda_device_count"), 0)
     gate_enabled = bool(quality_gate.get("enabled"))
     gate_passed = quality_gate.get("passed")
@@ -1177,6 +1620,7 @@ def render_ops_visual(summary: Dict[str, Any], status_text: str, log_text: str)
     pass_k = _fmt_pct(evaluation.get("pass_at_k"))
     pass_1 = _fmt_pct(evaluation.get("pass_at_1"))
     exact_k = _fmt_pct(evaluation.get("exact_at_k"))
     push_state = "Pending"
     if push_report:
@@ -1189,6 +1633,30 @@ def render_ops_visual(summary: Dict[str, Any], status_text: str, log_text: str)
         else:
             push_state = "Blocked"
     return f"""
 <div class="ops-visual">
   <div class="ops-visual-head">
@@ -1206,6 +1674,11 @@ def render_ops_visual(summary: Dict[str, Any], status_text: str, log_text: str)
       <div class="ops-v">{runtime_mode}</div>
       <div class="ops-v-small">cuda devices: {device_count}</div>
     </div>
     <div class="ops-card">
       <div class="ops-k">Stage Progress</div>
       <div class="ops-v">{stage_meta['completed']} / {stage_meta['stage_count']}</div>
@@ -1220,9 +1693,31 @@ def render_ops_visual(summary: Dict[str, Any], status_text: str, log_text: str)
     <div class="ops-card">
       <div class="ops-k">Eval pass@k</div>
       <div class="ops-v">{pass_k}</div>
-      <div class="ops-v-small">pass@1 {pass_1} | exact@k {exact_k}</div>
     </div>
     <div class="ops-card">
       <div class="ops-k">Loss Stream</div>
       {sparkline_html}
     </div>
@@ -1271,9 +1766,26 @@ def validate_repo_id(repo_id: str, field_name: str) -> str:
     return value
 def ensure_workspace() -> None:
     DATA_DIR.mkdir(parents=True, exist_ok=True)
     RUNTIME_DIR.mkdir(parents=True, exist_ok=True)
 def run_runtime_snapshot() -> Dict[str, Any]:
@@ -1575,7 +2087,7 @@ def run_pipeline_core(
 ) -> Generator[Tuple[str, str, str], None, None]:
     log_lines: List[str] = []
     summary: Dict[str, Any] = {}
-    run_label = dt.datetime.utcnow().strftime("run-%Y%m%d-%H%M%S")
     if not begin_run(run_label):
         append_log(log_lines, "A run is already in progress. Wait for it to finish or click Stop.")
@@ -1618,12 +2130,7 @@ def run_pipeline_core(
             force_redownload = False
             preflight_only = False
-        if stage_start < 1:
-            raise ValueError("Start stage must be >= 1.")
-        if stage_start > TEMPLATE_STAGE_COUNT:
-            raise ValueError(f"Start stage must be <= {TEMPLATE_STAGE_COUNT}.")
-        if stage_count < 1:
-            raise ValueError("How many stages must be >= 1.")
         if eval_k < 1:
             raise ValueError("Eval K must be >= 1.")
         if eval_samples < 1:
@@ -1663,6 +2170,7 @@ def run_pipeline_core(
                 "force_redownload": bool(force_redownload),
                 "preflight_only": bool(preflight_only),
                 "runtime": runtime,
             }
         )
@@ -1746,6 +2254,9 @@ def run_pipeline_core(
             env.pop("HF_TOKEN", None)
             env.pop("HUGGINGFACE_HUB_TOKEN", None)
         env["PYTHONUNBUFFERED"] = "1"
         train_cmd = [
             sys.executable,
@@ -1782,6 +2293,7 @@ def run_pipeline_core(
             summary["result"] = "cancelled"
             summary["finished_at_utc"] = now_ts()
             append_log(log_lines, "Run cancelled by user.")
             yield "\n".join(log_lines), "Cancelled", summary_text(summary)
             return
@@ -1789,6 +2301,7 @@ def run_pipeline_core(
             summary["result"] = "failed"
             summary["failure_stage"] = "training"
             summary["finished_at_utc"] = now_ts()
             yield "\n".join(log_lines), "Failed", summary_text(summary)
             return
@@ -1796,6 +2309,7 @@ def run_pipeline_core(
             summary["result"] = "preflight_passed"
             summary["finished_at_utc"] = now_ts()
             append_log(log_lines, "Validation mode completed successfully.")
             yield "\n".join(log_lines), "Preflight complete", summary_text(summary)
             return
@@ -1884,6 +2398,7 @@ def run_pipeline_core(
                 summary["result"] = "cancelled"
                 summary["finished_at_utc"] = now_ts()
                 append_log(log_lines, "Run cancelled by user.")
                 yield "\n".join(log_lines), "Cancelled", summary_text(summary)
                 return
@@ -1891,6 +2406,7 @@ def run_pipeline_core(
                 summary["result"] = "failed"
                 summary["failure_stage"] = "evaluation"
                 summary["finished_at_utc"] = now_ts()
                 yield "\n".join(log_lines), "Failed", summary_text(summary)
                 return
@@ -1911,16 +2427,25 @@ def run_pipeline_core(
         summary["result"] = "completed"
         summary["finished_at_utc"] = now_ts()
         append_log(log_lines, "Pipeline completed.")
         yield "\n".join(log_lines), "Completed", summary_text(summary)
     except Exception as exc:
         cancelled = is_cancel_requested() or str(exc) == "Run cancelled by user."
         summary["result"] = "cancelled" if cancelled else "failed"
-        summary["error"] = {"type": type(exc).__name__, "message": str(exc)}
         summary["finished_at_utc"] = now_ts()
         append_log(
             log_lines,
             f"Pipeline {'cancelled' if cancelled else 'failed'}: {type(exc).__name__}: {exc}",
         )
         yield "\n".join(log_lines), "Cancelled" if cancelled else "Failed", summary_text(summary)
     finally:
         finish_run()
@@ -1946,6 +2471,7 @@ def run_pipeline(
     preflight_only: bool,
 ) -> Generator[Tuple[str, str, str], None, None]:
     cycle_index = 1
     continuous_mode = bool(continuous_mode)
     if preflight_only and continuous_mode:
         continuous_mode = False
@@ -2013,6 +2539,19 @@ def run_pipeline(
             yield compose_ops_console(session_logs, final_summary_json), stop_status, final_visual
             break
         session_logs = _merge_log_chunk(
             session_logs,
             f"[{now_ts()}] Continuous mode: cycle {cycle_index} finished with result="
@@ -2020,6 +2559,28 @@ def run_pipeline(
         )
         restart_status = f"Cycle {cycle_index}: restarting"
         yield compose_ops_console(session_logs, final_summary_json), restart_status, final_visual
         cycle_index += 1

 import sys
 import threading
 import time
+import traceback
 from pathlib import Path
 from typing import Any, Dict, Generator, List, Optional, Tuple
 WORKSPACE_DIR = ROOT / "workspace"
 DATA_DIR = WORKSPACE_DIR / "data" / "releases" / "v1"
 RUNTIME_DIR = WORKSPACE_DIR / "runtime"
+HF_HOME_DIR = WORKSPACE_DIR / ".hf_home"
+HF_DATASETS_CACHE_DIR = HF_HOME_DIR / "datasets"
+HF_HUB_CACHE_DIR = HF_HOME_DIR / "hub"
 CONFIG_TEMPLATE = ROOT / "configs" / "deepseek_math_sota.yaml"
 TRAIN_SCRIPT = ROOT / "scripts" / "train_sota.py"
 EVAL_SCRIPT = ROOT / "scripts" / "eval_sota.py"
 ACTIVE_PROCESS: Optional[subprocess.Popen] = None
 ACTIVE_RUN_LABEL = ""
+def _env_int(name: str, default: int, min_value: int = 0) -> int:
+    raw = (os.environ.get(name) or "").strip()
+    if not raw:
+        return default
+    try:
+        value = int(raw)
+    except ValueError:
+        return default
+    return max(min_value, value)
+CONTINUOUS_RESTART_DELAY_SECONDS = _env_int("CONTINUOUS_RESTART_DELAY_SECONDS", default=15, min_value=0)
+CONTINUOUS_MAX_CONSECUTIVE_FAILURES = _env_int("CONTINUOUS_MAX_CONSECUTIVE_FAILURES", default=3, min_value=1)
+APP_LOG_MAX_CHARS = _env_int("APP_LOG_MAX_CHARS", default=200_000, min_value=20_000)
+RUN_HISTORY_LIMIT = _env_int("RUN_HISTORY_LIMIT", default=80, min_value=5)
+RECENT_RUNS_VISUAL_LIMIT = 6
+RUN_HISTORY_PATH = RUNTIME_DIR / "run_history.json"
+RUN_RECORDS_DIR = RUNTIME_DIR / "run_records"
 PROJECT_DESCRIPTION = """
 ## Pipeline Scope
 This app runs the full training pipeline for the `maths-conjuncture-solutions` project.
 @import url("https://fonts.googleapis.com/css2?family=Rajdhani:wght@500;600;700&family=IBM+Plex+Mono:wght@400;500&display=swap");
 :root {
+  --nt-bg: #0a0a0a;
+  --nt-bg-2: #131313;
+  --nt-bg-radial: rgba(255, 255, 255, 0.08);
+  --nt-panel: #151515;
+  --nt-panel-2: #1c1c1c;
+  --nt-border: #313131;
+  --nt-border-strong: #515151;
+  --nt-text: #ececec;
+  --nt-muted: #adadad;
+  --nt-accent: #f5f5f5;
+  --nt-accent-2: #ffffff;
+  --nt-shadow: rgba(0, 0, 0, 0.58);
+  --nt-hero-border: #5a5a5a;
+  --nt-hero-bg-1: #181818;
+  --nt-hero-bg-2: #252525;
+  --nt-hero-bg-3: #333333;
+  --nt-hero-text: #f7f7f7;
+  --nt-hero-topline: #cecece;
+  --nt-hero-body: #e1e1e1;
+  --nt-hero-grid: rgba(255, 255, 255, 0.09);
+  --nt-chip-border: rgba(212, 212, 212, 0.35);
+  --nt-chip-bg: rgba(24, 24, 24, 0.72);
+  --nt-chip-text: #f0f0f0;
+  --nt-input-bg: #121212;
+  --nt-focus-ring: rgba(220, 220, 220, 0.2);
+  --nt-btn-border: #5a5a5a;
+  --nt-btn-bg-1: #2b2b2b;
+  --nt-btn-bg-2: #1e1e1e;
+  --nt-btn-text: #efefef;
+  --nt-primary-btn-border: #8a8a8a;
+  --nt-primary-btn-bg-1: #505050;
+  --nt-primary-btn-bg-2: #3a3a3a;
+  --nt-primary-btn-text: #ffffff;
+  --nt-stop-btn-border: #6e6e6e;
+  --nt-stop-btn-bg-1: #464646;
+  --nt-stop-btn-bg-2: #303030;
+  --nt-stop-btn-text: #f8f8f8;
+  --nt-ops-panel-1: #171717;
+  --nt-ops-panel-2: #101010;
+  --nt-card-border: #3c3c3c;
+  --nt-card-bg-1: #232323;
+  --nt-card-bg-2: #1a1a1a;
+  --nt-metric-text: #f2f2f2;
+  --nt-meter-border: #5a5a5a;
+  --nt-meter-bg: #2d2d2d;
+  --nt-meter-fill-1: #d4d4d4;
+  --nt-meter-fill-2: #8c8c8c;
+  --nt-spark-border: #555555;
+  --nt-spark-bg: #161616;
+  --nt-spark-stroke: #dddddd;
+  --nt-settings-border: rgba(200, 200, 200, 0.42);
+  --nt-settings-bg: rgba(20, 20, 20, 0.72);
+  --nt-settings-bg-hover: rgba(32, 32, 32, 0.9);
+  --nt-settings-text: #f2f2f2;
+  --nt-settings-menu-bg: #1d1d1d;
+  --nt-settings-menu-border: #545454;
+  --nt-settings-menu-shadow: rgba(0, 0, 0, 0.62);
+  --nt-settings-menu-btn: #f0f0f0;
+  --nt-settings-menu-btn-hover: #303030;
+  --nt-settings-menu-btn-active-bg: #6a6a6a;
+  --nt-settings-menu-btn-active-text: #101010;
 }
 :root[data-nt-theme="light"] {
+  --nt-bg: #f2f2f2;
+  --nt-bg-2: #e5e5e5;
+  --nt-bg-radial: rgba(255, 255, 255, 0.8);
+  --nt-panel: #ffffff;
+  --nt-panel-2: #f4f4f4;
+  --nt-border: #c2c2c2;
+  --nt-border-strong: #9f9f9f;
+  --nt-text: #1a1a1a;
+  --nt-muted: #5c5c5c;
+  --nt-accent: #111111;
+  --nt-accent-2: #000000;
+  --nt-shadow: rgba(0, 0, 0, 0.14);
+  --nt-hero-border: #4d4d4d;
+  --nt-hero-bg-1: #3a3a3a;
+  --nt-hero-bg-2: #2c2c2c;
+  --nt-hero-bg-3: #1f1f1f;
+  --nt-hero-text: #fafafa;
+  --nt-hero-topline: #d8d8d8;
+  --nt-hero-body: #ececec;
+  --nt-hero-grid: rgba(255, 255, 255, 0.16);
+  --nt-chip-border: rgba(236, 236, 236, 0.52);
+  --nt-chip-bg: rgba(20, 20, 20, 0.68);
+  --nt-chip-text: #f5f5f5;
+  --nt-input-bg: #ffffff;
+  --nt-focus-ring: rgba(110, 110, 110, 0.24);
+  --nt-btn-border: #8d8d8d;
+  --nt-btn-bg-1: #f8f8f8;
+  --nt-btn-bg-2: #e8e8e8;
+  --nt-btn-text: #1f1f1f;
+  --nt-primary-btn-border: #505050;
+  --nt-primary-btn-bg-1: #6d6d6d;
+  --nt-primary-btn-bg-2: #505050;
+  --nt-primary-btn-text: #ffffff;
+  --nt-stop-btn-border: #595959;
+  --nt-stop-btn-bg-1: #7a7a7a;
+  --nt-stop-btn-bg-2: #5b5b5b;
+  --nt-stop-btn-text: #ffffff;
+  --nt-ops-panel-1: #fbfbfb;
+  --nt-ops-panel-2: #efefef;
+  --nt-card-border: #b8b8b8;
+  --nt-card-bg-1: #ffffff;
+  --nt-card-bg-2: #f2f2f2;
+  --nt-metric-text: #1d1d1d;
+  --nt-meter-border: #9d9d9d;
+  --nt-meter-bg: #dadada;
+  --nt-meter-fill-1: #5e5e5e;
+  --nt-meter-fill-2: #3f3f3f;
+  --nt-spark-border: #a9a9a9;
+  --nt-spark-bg: #f8f8f8;
+  --nt-spark-stroke: #404040;
+  --nt-settings-border: rgba(120, 120, 120, 0.4);
+  --nt-settings-bg: rgba(45, 45, 45, 0.62);
+  --nt-settings-bg-hover: rgba(30, 30, 30, 0.8);
+  --nt-settings-text: #f7f7f7;
+  --nt-settings-menu-bg: #f1f1f1;
+  --nt-settings-menu-border: #9d9d9d;
+  --nt-settings-menu-shadow: rgba(0, 0, 0, 0.18);
+  --nt-settings-menu-btn: #1f1f1f;
+  --nt-settings-menu-btn-hover: #dcdcdc;
+  --nt-settings-menu-btn-active-bg: #4e4e4e;
+  --nt-settings-menu-btn-active-text: #ffffff;
 }
 @keyframes ntFadeUp {
   color: var(--nt-text) !important;
   font-family: "Rajdhani", "Segoe UI", sans-serif !important;
   background:
+    linear-gradient(rgba(255, 255, 255, 0.05) 1px, transparent 1px),
+    linear-gradient(90deg, rgba(255, 255, 255, 0.05) 1px, transparent 1px),
     radial-gradient(circle at 12% 10%, var(--nt-bg-radial) 0%, rgba(255, 255, 255, 0) 44%),
     linear-gradient(145deg, var(--nt-bg) 0%, var(--nt-bg-2) 100%) !important;
   background-size: 24px 24px, 24px 24px, 100% 100%, 100% 100% !important;
   background: linear-gradient(180deg, var(--nt-panel) 0%, var(--nt-panel-2) 100%) !important;
   border: 1px solid var(--nt-border) !important;
   border-radius: 8px !important;
+  box-shadow: inset 0 0 0 1px rgba(255, 255, 255, 0.04), 0 12px 26px var(--nt-shadow) !important;
 }
 .app-hero {
   background: linear-gradient(132deg, var(--nt-hero-bg-1) 0%, var(--nt-hero-bg-2) 62%, var(--nt-hero-bg-3) 100%);
   color: var(--nt-hero-text);
   padding: 22px 24px 20px;
+  box-shadow: inset 0 0 0 1px rgba(255, 255, 255, 0.05), 0 18px 38px rgba(0, 0, 0, 0.38);
   animation: ntFadeUp 500ms ease-out both;
 }
 .nt-settings-menu .nt-mode-btn.active {
   background: var(--nt-settings-menu-btn-active-bg) !important;
   color: var(--nt-settings-menu-btn-active-text) !important;
+  box-shadow: inset 0 0 0 1px rgba(255, 255, 255, 0.2) !important;
 }
 .section-copy {
 .gradio-container button:hover {
   transform: translateY(-1px);
+  box-shadow: 0 0 0 1px rgba(255, 255, 255, 0.2), 0 10px 20px rgba(0, 0, 0, 0.35);
   filter: brightness(1.03);
 }
   background: linear-gradient(180deg, var(--nt-ops-panel-1) 0%, var(--nt-ops-panel-2) 100%);
   padding: 14px;
   animation: ntFadeUp 420ms ease-out both;
+  box-shadow: inset 0 0 0 1px rgba(255, 255, 255, 0.04), 0 12px 28px rgba(0, 0, 0, 0.34);
 }
 .ops-visual-head {
   padding: 9px 10px;
   min-height: 86px;
   animation: ntFadeUp 460ms ease-out both;
+  box-shadow: inset 0 0 0 1px rgba(255, 255, 255, 0.04);
+}
+.ops-card-wide {
+  grid-column: span 2;
 }
 .ops-k {
   display: block;
 }
+.ops-chip-list {
+  margin-top: 6px;
+  display: flex;
+  flex-wrap: wrap;
+  gap: 6px;
+}
+.ops-chip {
+  border-radius: 999px;
+  border: 1px solid var(--nt-chip-border);
+  background: var(--nt-chip-bg);
+  color: var(--nt-chip-text);
+  padding: 2px 8px;
+  font-size: 0.66rem;
+  text-transform: uppercase;
+  letter-spacing: 0.1em;
+  font-family: "IBM Plex Mono", "Courier New", monospace;
+}
+.ops-chip.off {
+  opacity: 0.72;
+}
+.ops-stage-list {
+  margin: 8px 0 0;
+  padding: 0;
+  list-style: none;
+  display: grid;
+  gap: 6px;
+}
+.ops-stage-item {
+  display: flex;
+  align-items: center;
+  gap: 8px;
+}
+.ops-stage-dot {
+  width: 12px;
+  min-width: 12px;
+  height: 12px;
+  border-radius: 999px;
+  border: 1px solid var(--nt-border-strong);
+  background: var(--nt-meter-bg);
+  color: transparent;
+}
+.ops-stage-item.done .ops-stage-dot {
+  background: linear-gradient(90deg, var(--nt-meter-fill-1) 0%, var(--nt-meter-fill-2) 100%);
+}
+.ops-stage-item.active .ops-stage-dot {
+  background: var(--nt-accent);
+  box-shadow: 0 0 0 3px rgba(255, 255, 255, 0.22);
+}
+.ops-stage-label {
+  color: var(--nt-muted);
+  font-size: 0.78rem;
+  font-family: "IBM Plex Mono", "Courier New", monospace;
+}
+.ops-stage-item.done .ops-stage-label,
+.ops-stage-item.active .ops-stage-label {
+  color: var(--nt-text);
+}
+.ops-run-list {
+  margin: 8px 0 0;
+  padding: 0;
+  list-style: none;
+  display: grid;
+  gap: 8px;
+}
+.ops-run-item {
+  border: 1px solid var(--nt-card-border);
+  border-radius: 8px;
+  background: rgba(255, 255, 255, 0.03);
+  padding: 8px;
+}
+.ops-run-head {
+  display: flex;
+  justify-content: space-between;
+  align-items: center;
+  gap: 10px;
+}
+.ops-run-label {
+  color: var(--nt-text);
+  font-size: 0.78rem;
+  font-family: "IBM Plex Mono", "Courier New", monospace;
+}
+.ops-run-badge {
+  border-radius: 999px;
+  border: 1px solid var(--nt-chip-border);
+  background: var(--nt-chip-bg);
+  color: var(--nt-chip-text);
+  font-size: 0.62rem;
+  text-transform: uppercase;
+  letter-spacing: 0.08em;
+  padding: 2px 7px;
+  font-family: "IBM Plex Mono", "Courier New", monospace;
+}
+.ops-run-badge.ok {
+  border-color: #8b8b8b;
+  color: #f0f0f0;
+}
+.ops-run-badge.fail {
+  border-color: #6f6f6f;
+  color: #d8d8d8;
+}
+.ops-run-badge.cancel {
+  border-color: #7d7d7d;
+  color: #e2e2e2;
+}
 .gradio-container [data-testid="footer"],
 .gradio-container .built-with {
   display: none !important;
   .ops-grid {
     grid-template-columns: 1fr;
   }
+  .ops-card-wide {
+    grid-column: auto;
+  }
 }
 """
 def now_ts() -> str:
+    return dt.datetime.now(dt.timezone.utc).strftime("%Y-%m-%d %H:%M:%S UTC")
 def append_log(lines: List[str], message: str) -> str:
     lines.append(f"[{now_ts()}] {message}")
     text = "\n".join(lines)
+    if len(text) > APP_LOG_MAX_CHARS:
+        text = text[-APP_LOG_MAX_CHARS:]
     return text
     return payload
+def _merge_log_chunk(existing: str, chunk: str, max_chars: int = APP_LOG_MAX_CHARS) -> str:
     if not chunk:
         return existing
     merged = existing
         return "--"
+def _parse_utc_stamp(value: Any) -> Optional[dt.datetime]:
+    text = str(value or "").strip()
+    if not text:
+        return None
+    try:
+        parsed = dt.datetime.strptime(text, "%Y-%m-%d %H:%M:%S UTC")
+    except ValueError:
+        return None
+    return parsed.replace(tzinfo=dt.timezone.utc)
+def _format_duration(seconds: float) -> str:
+    total = max(0, int(seconds))
+    hours, rem = divmod(total, 3600)
+    minutes, sec = divmod(rem, 60)
+    if hours > 0:
+        return f"{hours}h {minutes:02d}m {sec:02d}s"
+    return f"{minutes:02d}m {sec:02d}s"
+def _runtime_duration_text(summary: Dict[str, Any]) -> str:
+    started_at = _parse_utc_stamp(summary.get("started_at_utc"))
+    if started_at is None:
+        return "--"
+    finished_at = _parse_utc_stamp(summary.get("finished_at_utc"))
+    end_time = finished_at or dt.datetime.now(dt.timezone.utc)
+    state = "done" if finished_at is not None else "running"
+    elapsed = (end_time - started_at).total_seconds()
+    return f"{_format_duration(elapsed)} ({state})"
+def _maybe_float(value: Any) -> Optional[float]:
+    try:
+        return float(value)
+    except (TypeError, ValueError):
+        return None
+def _duration_seconds(summary: Dict[str, Any]) -> Optional[int]:
+    started_at = _parse_utc_stamp(summary.get("started_at_utc"))
+    if started_at is None:
+        return None
+    finished_at = _parse_utc_stamp(summary.get("finished_at_utc"))
+    if finished_at is None:
+        return None
+    return max(0, int((finished_at - started_at).total_seconds()))
+def _normalize_history_entry(entry: Dict[str, Any]) -> Optional[Dict[str, Any]]:
+    run_label = str(entry.get("run_label") or "").strip()
+    result = str(entry.get("result") or "").strip().lower()
+    if not run_label or not result:
+        return None
+    evaluation = _as_dict(entry.get("evaluation"))
+    quality_gate = _as_dict(entry.get("quality_gate"))
+    push_report = _as_dict(entry.get("push"))
+    return {
+        "run_label": run_label,
+        "result": result,
+        "started_at_utc": str(entry.get("started_at_utc") or "").strip(),
+        "finished_at_utc": str(entry.get("finished_at_utc") or "").strip(),
+        "duration_seconds": _safe_int(entry.get("duration_seconds"), -1),
+        "compute_mode": str(entry.get("compute_mode") or "").strip(),
+        "evaluation": {
+            "pass_at_1": _maybe_float(evaluation.get("pass_at_1")),
+            "pass_at_k": _maybe_float(evaluation.get("pass_at_k")),
+            "evaluated_rows": _safe_int(evaluation.get("evaluated_rows"), 0),
+        },
+        "quality_gate": {
+            "enabled": bool(quality_gate.get("enabled")),
+            "passed": quality_gate.get("passed"),
+        },
+        "push": {
+            "requested": bool(push_report.get("requested")),
+            "performed": bool(push_report.get("performed")),
+        },
+        "error_message": str(entry.get("error_message") or "").strip(),
+    }
+def _read_history_entries() -> List[Dict[str, Any]]:
+    if not RUN_HISTORY_PATH.exists():
+        return []
+    try:
+        payload = json.loads(RUN_HISTORY_PATH.read_text(encoding="utf-8"))
+    except Exception:
+        return []
+    if not isinstance(payload, list):
+        return []
+    normalized: List[Dict[str, Any]] = []
+    for item in payload:
+        if not isinstance(item, dict):
+            continue
+        parsed = _normalize_history_entry(item)
+        if parsed is not None:
+            normalized.append(parsed)
+    return normalized
+def load_run_history(limit: int = RECENT_RUNS_VISUAL_LIMIT) -> List[Dict[str, Any]]:
+    normalized_limit = max(1, _safe_int(limit, RECENT_RUNS_VISUAL_LIMIT))
+    return _read_history_entries()[:normalized_limit]
+def _build_run_history_entry(summary: Dict[str, Any]) -> Dict[str, Any]:
+    evaluation = _as_dict(summary.get("evaluation"))
+    quality_gate = _as_dict(summary.get("quality_gate"))
+    push_report = _as_dict(summary.get("push"))
+    error_payload = _as_dict(summary.get("error"))
+    return {
+        "run_label": str(summary.get("run_label") or "").strip(),
+        "result": str(summary.get("result") or "").strip().lower(),
+        "started_at_utc": str(summary.get("started_at_utc") or "").strip(),
+        "finished_at_utc": str(summary.get("finished_at_utc") or "").strip(),
+        "duration_seconds": _duration_seconds(summary),
+        "compute_mode": str(summary.get("compute_mode") or "").strip(),
+        "evaluation": {
+            "pass_at_1": _maybe_float(evaluation.get("pass_at_1")),
+            "pass_at_k": _maybe_float(evaluation.get("pass_at_k")),
+            "evaluated_rows": _safe_int(evaluation.get("evaluated_rows"), 0),
+        },
+        "quality_gate": {
+            "enabled": bool(quality_gate.get("enabled")),
+            "passed": quality_gate.get("passed"),
+        },
+        "push": {
+            "requested": bool(push_report.get("requested")),
+            "performed": bool(push_report.get("performed")),
+        },
+        "error_message": str(error_payload.get("message") or "").strip()[:280],
+    }
+def persist_run_artifacts(summary: Dict[str, Any]) -> Optional[str]:
+    run_label = str(summary.get("run_label") or "").strip()
+    result = str(summary.get("result") or "").strip().lower()
+    if not run_label or not result or result == "busy":
+        return None
+    try:
+        ensure_workspace()
+        record_path = RUN_RECORDS_DIR / f"{run_label}.json"
+        record_payload = dict(summary)
+        record_payload["recorded_at_utc"] = now_ts()
+        record_path.write_text(json.dumps(record_payload, ensure_ascii=True, indent=2), encoding="utf-8")
+        entry = _build_run_history_entry(summary)
+        history = _read_history_entries()
+        history = [item for item in history if str(item.get("run_label") or "").strip() != run_label]
+        history.insert(0, entry)
+        history = history[:RUN_HISTORY_LIMIT]
+        tmp_path = RUN_HISTORY_PATH.with_suffix(".json.tmp")
+        tmp_path.write_text(json.dumps(history, ensure_ascii=True, indent=2), encoding="utf-8")
+        tmp_path.replace(RUN_HISTORY_PATH)
+        return None
+    except Exception as exc:
+        return f"Warning: run artifact persistence failed: {type(exc).__name__}: {exc}"
+def _refresh_recent_runs(summary: Dict[str, Any], log_lines: List[str]) -> None:
+    warning = persist_run_artifacts(summary)
+    if warning:
+        append_log(log_lines, warning)
+    summary["recent_runs"] = load_run_history(limit=RECENT_RUNS_VISUAL_LIMIT)
+def _run_result_badge_class(result_text: str) -> str:
+    normalized = (result_text or "").strip().lower()
+    if normalized in {"completed", "preflight_passed"}:
+        return "ok"
+    if normalized in {"failed", "error"}:
+        return "fail"
+    if normalized in {"cancelled", "canceled"}:
+        return "cancel"
+    return "neutral"
+def _build_recent_runs_panel(summary: Dict[str, Any]) -> str:
+    raw_runs = summary.get("recent_runs")
+    if not isinstance(raw_runs, list) or not raw_runs:
+        raw_runs = load_run_history(limit=RECENT_RUNS_VISUAL_LIMIT)
+    entries = [item for item in raw_runs if isinstance(item, dict)]
+    if not entries:
+        return "<div class='ops-v-small ops-empty'>No completed cycles recorded yet.</div>"
+    lines: List[str] = []
+    for entry in entries[:RECENT_RUNS_VISUAL_LIMIT]:
+        run_label = html.escape(str(entry.get("run_label") or "--"))
+        result_text = str(entry.get("result") or "unknown").replace("_", " ").strip().lower()
+        badge_cls = _run_result_badge_class(result_text)
+        badge_label = html.escape(result_text)
+        evaluation = _as_dict(entry.get("evaluation"))
+        pass_1 = _fmt_pct(evaluation.get("pass_at_1"))
+        pass_k = _fmt_pct(evaluation.get("pass_at_k"))
+        rows = _safe_int(evaluation.get("evaluated_rows"), 0)
+        duration_seconds = _safe_int(entry.get("duration_seconds"), -1)
+        duration_text = "--" if duration_seconds < 0 else _format_duration(duration_seconds)
+        finished_at = html.escape(str(entry.get("finished_at_utc") or "--"))
+        lines.append(
+            "<li class='ops-run-item'>"
+            "<div class='ops-run-head'>"
+            f"<span class='ops-run-label'>{run_label}</span>"
+            f"<span class='ops-run-badge {badge_cls}'>{badge_label}</span>"
+            "</div>"
+            f"<div class='ops-v-small'>duration {html.escape(duration_text)} | pass@1 {pass_1} | "
+            f"pass@k {pass_k} | rows {rows}</div>"
+            f"<div class='ops-v-small'>finished {finished_at}</div>"
+            "</li>"
+        )
+    return f"<ul class='ops-run-list'>{''.join(lines)}</ul>"
+def _build_artifact_index(summary: Dict[str, Any]) -> str:
+    runtime_cfg = str(summary.get("runtime_config") or "").strip()
+    training_summary_path = str(summary.get("training_summary_path") or "").strip()
+    evaluation = _as_dict(summary.get("evaluation"))
+    eval_report_path = str(evaluation.get("report_path") or "").strip()
+    adapter_path = str(TRAIN_OUTPUT_DIR / "final_adapter")
+    def _display_path(path_text: str) -> str:
+        if not path_text:
+            return "--"
+        return html.escape(path_text)
+    return (
+        "<div class='ops-v-small'>"
+        f"config: {_display_path(runtime_cfg)}"
+        "</div>"
+        "<div class='ops-v-small'>"
+        f"training summary: {_display_path(training_summary_path)}"
+        "</div>"
+        "<div class='ops-v-small'>"
+        f"eval report: {_display_path(eval_report_path)}"
+        "</div>"
+        "<div class='ops-v-small'>"
+        f"adapter dir: {_display_path(adapter_path)}"
+        "</div>"
+    )
 def _extract_loss_values(log_text: str, limit: int = 48) -> List[float]:
     losses: List[float] = []
     for line in log_text.splitlines():
     }
+def _build_stage_timeline(summary: Dict[str, Any], stage_meta: Dict[str, Any]) -> str:
+    start_stage = max(1, _safe_int(stage_meta.get("start_stage"), 1))
+    stage_count = max(1, _safe_int(stage_meta.get("stage_count"), TEMPLATE_STAGE_COUNT))
+    completed = max(0, _safe_int(stage_meta.get("completed"), 0))
+    active_stage = _safe_int(stage_meta.get("active_stage"), 0)
+    stage_cfg = TEMPLATE_CFG.get("stages", [])
+    if not isinstance(stage_cfg, list):
+        stage_cfg = []
+    lines: List[str] = []
+    for stage_index in range(start_stage, start_stage + stage_count):
+        stage_name = f"stage_{stage_index:02d}"
+        if 0 <= stage_index - 1 < len(stage_cfg):
+            stage_data = stage_cfg[stage_index - 1]
+            if isinstance(stage_data, dict):
+                candidate = str(stage_data.get("name") or "").strip()
+                if candidate:
+                    stage_name = candidate
+        relative = stage_index - start_stage + 1
+        if relative <= completed:
+            cls = "done"
+            marker = "done"
+        elif active_stage == stage_index:
+            cls = "active"
+            marker = "active"
+        else:
+            cls = "pending"
+            marker = "pending"
+        lines.append(
+            f"<li class='ops-stage-item {cls}'><span class='ops-stage-dot'>{marker}</span>"
+            f"<span class='ops-stage-label'>{stage_index}. {html.escape(stage_name)}</span></li>"
+        )
+    if not lines:
+        return "<div class='ops-v-small ops-empty'>No stage layout available.</div>"
+    return f"<ul class='ops-stage-list'>{''.join(lines)}</ul>"
 def render_ops_visual(summary: Dict[str, Any], status_text: str, log_text: str) -> str:
     safe_summary = _as_dict(summary)
     runtime = _as_dict(safe_summary.get("runtime"))
     quality_gate = _as_dict(safe_summary.get("quality_gate"))
     evaluation = _as_dict(safe_summary.get("evaluation"))
     push_report = _as_dict(safe_summary.get("push"))
+    error_payload = _as_dict(safe_summary.get("error"))
     run_label = html.escape(str(safe_summary.get("run_label") or "not-started"))
     status_value = html.escape(status_text or "Idle")
     runtime_mode = "GPU READY" if runtime.get("cuda_available") else "CPU FALLBACK"
+    compute_mode = str(safe_summary.get("compute_mode") or "").strip().lower()
+    if compute_mode == "gpu":
+        runtime_mode = "GPU TRAINING"
+    elif compute_mode == "cpu_fallback":
+        runtime_mode = "CPU FALLBACK"
     runtime_mode = html.escape(runtime_mode)
     device_count = _safe_int(runtime.get("cuda_device_count"), 0)
+    duration_text = html.escape(_runtime_duration_text(safe_summary))
+    started_at = html.escape(str(safe_summary.get("started_at_utc") or "--"))
     gate_enabled = bool(quality_gate.get("enabled"))
     gate_passed = quality_gate.get("passed")
     pass_k = _fmt_pct(evaluation.get("pass_at_k"))
     pass_1 = _fmt_pct(evaluation.get("pass_at_1"))
     exact_k = _fmt_pct(evaluation.get("exact_at_k"))
+    evaluated_rows = _safe_int(evaluation.get("evaluated_rows"), 0)
     push_state = "Pending"
     if push_report:
         else:
             push_state = "Blocked"
+    profile_flags = [
+        ("autonomous", bool(safe_summary.get("autonomous_mode"))),
+        ("continuous", "Cycle" in (status_text or "")),
+        ("preflight", bool(safe_summary.get("preflight_only"))),
+        ("post_eval", bool(safe_summary.get("run_eval"))),
+        ("quality_gate", bool(safe_summary.get("enforce_quality_gate"))),
+        ("push", bool(safe_summary.get("push_to_hub"))),
+    ]
+    profile_chips = "".join(
+        f"<span class='ops-chip {'on' if enabled else 'off'}'>{name}:{'on' if enabled else 'off'}</span>"
+        for name, enabled in profile_flags
+    )
+    stage_timeline = _build_stage_timeline(safe_summary, stage_meta)
+    recent_runs = _build_recent_runs_panel(safe_summary)
+    artifact_index = _build_artifact_index(safe_summary)
+    error_message = str(error_payload.get("message") or "").strip()
+    if error_message:
+        if len(error_message) > 130:
+            error_message = error_message[:127] + "..."
+        error_text = html.escape(error_message)
+    else:
+        error_text = "none"
     return f"""
 <div class="ops-visual">
   <div class="ops-visual-head">
       <div class="ops-v">{runtime_mode}</div>
       <div class="ops-v-small">cuda devices: {device_count}</div>
     </div>
+    <div class="ops-card">
+      <div class="ops-k">Runtime Window</div>
+      <div class="ops-v">{duration_text}</div>
+      <div class="ops-v-small">start: {started_at}</div>
+    </div>
     <div class="ops-card">
       <div class="ops-k">Stage Progress</div>
       <div class="ops-v">{stage_meta['completed']} / {stage_meta['stage_count']}</div>
     <div class="ops-card">
       <div class="ops-k">Eval pass@k</div>
       <div class="ops-v">{pass_k}</div>
+      <div class="ops-v-small">pass@1 {pass_1} | exact@k {exact_k} | rows {evaluated_rows}</div>
     </div>
     <div class="ops-card">
+      <div class="ops-k">Run Profile</div>
+      <div class="ops-chip-list">{profile_chips}</div>
+      <div class="ops-v-small">result: {html.escape(str(safe_summary.get('result') or 'pending'))}</div>
+    </div>
+    <div class="ops-card">
+      <div class="ops-k">Last Error</div>
+      <div class="ops-v-small">{error_text}</div>
+      <div class="ops-v-small">status: {status_value}</div>
+    </div>
+    <div class="ops-card ops-card-wide">
+      <div class="ops-k">Stage Timeline</div>
+      {stage_timeline}
+    </div>
+    <div class="ops-card ops-card-wide">
+      <div class="ops-k">Recent Runs</div>
+      {recent_runs}
+    </div>
+    <div class="ops-card ops-card-wide">
+      <div class="ops-k">Artifact Index</div>
+      {artifact_index}
+    </div>
+    <div class="ops-card ops-card-wide">
       <div class="ops-k">Loss Stream</div>
       {sparkline_html}
     </div>
     return value
+def validate_stage_window(stage_start: int, stage_count: int) -> Tuple[int, int]:
+    if stage_start < 1:
+        raise ValueError("Start stage must be >= 1.")
+    if stage_start > TEMPLATE_STAGE_COUNT:
+        raise ValueError(f"Start stage must be <= {TEMPLATE_STAGE_COUNT}.")
+    if stage_count < 1:
+        raise ValueError("How many stages must be >= 1.")
+    max_count = TEMPLATE_STAGE_COUNT - stage_start + 1
+    if stage_count > max_count:
+        raise ValueError(f"Stage count exceeds available stages from start stage (max {max_count}).")
+    return stage_start, stage_count
 def ensure_workspace() -> None:
     DATA_DIR.mkdir(parents=True, exist_ok=True)
     RUNTIME_DIR.mkdir(parents=True, exist_ok=True)
+    HF_HOME_DIR.mkdir(parents=True, exist_ok=True)
+    HF_DATASETS_CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    HF_HUB_CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    RUN_RECORDS_DIR.mkdir(parents=True, exist_ok=True)
 def run_runtime_snapshot() -> Dict[str, Any]:
 ) -> Generator[Tuple[str, str, str], None, None]:
     log_lines: List[str] = []
     summary: Dict[str, Any] = {}
+    run_label = dt.datetime.now(dt.timezone.utc).strftime("run-%Y%m%d-%H%M%S")
     if not begin_run(run_label):
         append_log(log_lines, "A run is already in progress. Wait for it to finish or click Stop.")
             force_redownload = False
             preflight_only = False
+        stage_start, stage_count = validate_stage_window(stage_start, stage_count)
         if eval_k < 1:
             raise ValueError("Eval K must be >= 1.")
         if eval_samples < 1:
                 "force_redownload": bool(force_redownload),
                 "preflight_only": bool(preflight_only),
                 "runtime": runtime,
+                "recent_runs": load_run_history(limit=RECENT_RUNS_VISUAL_LIMIT),
             }
         )
             env.pop("HF_TOKEN", None)
             env.pop("HUGGINGFACE_HUB_TOKEN", None)
         env["PYTHONUNBUFFERED"] = "1"
+        env.setdefault("HF_HOME", str(HF_HOME_DIR))
+        env.setdefault("HF_DATASETS_CACHE", str(HF_DATASETS_CACHE_DIR))
+        env.setdefault("HUGGINGFACE_HUB_CACHE", str(HF_HUB_CACHE_DIR))
         train_cmd = [
             sys.executable,
             summary["result"] = "cancelled"
             summary["finished_at_utc"] = now_ts()
             append_log(log_lines, "Run cancelled by user.")
+            _refresh_recent_runs(summary, log_lines)
             yield "\n".join(log_lines), "Cancelled", summary_text(summary)
             return
             summary["result"] = "failed"
             summary["failure_stage"] = "training"
             summary["finished_at_utc"] = now_ts()
+            _refresh_recent_runs(summary, log_lines)
             yield "\n".join(log_lines), "Failed", summary_text(summary)
             return
             summary["result"] = "preflight_passed"
             summary["finished_at_utc"] = now_ts()
             append_log(log_lines, "Validation mode completed successfully.")
+            _refresh_recent_runs(summary, log_lines)
             yield "\n".join(log_lines), "Preflight complete", summary_text(summary)
             return
                 summary["result"] = "cancelled"
                 summary["finished_at_utc"] = now_ts()
                 append_log(log_lines, "Run cancelled by user.")
+                _refresh_recent_runs(summary, log_lines)
                 yield "\n".join(log_lines), "Cancelled", summary_text(summary)
                 return
                 summary["result"] = "failed"
                 summary["failure_stage"] = "evaluation"
                 summary["finished_at_utc"] = now_ts()
+                _refresh_recent_runs(summary, log_lines)
                 yield "\n".join(log_lines), "Failed", summary_text(summary)
                 return
         summary["result"] = "completed"
         summary["finished_at_utc"] = now_ts()
         append_log(log_lines, "Pipeline completed.")
+        _refresh_recent_runs(summary, log_lines)
         yield "\n".join(log_lines), "Completed", summary_text(summary)
     except Exception as exc:
         cancelled = is_cancel_requested() or str(exc) == "Run cancelled by user."
+        trace = traceback.format_exc()
         summary["result"] = "cancelled" if cancelled else "failed"
+        summary["error"] = {
+            "type": type(exc).__name__,
+            "message": str(exc),
+            "traceback": trace[-12000:],
+        }
         summary["finished_at_utc"] = now_ts()
         append_log(
             log_lines,
             f"Pipeline {'cancelled' if cancelled else 'failed'}: {type(exc).__name__}: {exc}",
         )
+        if trace.strip():
+            append_log(log_lines, trace.rstrip())
+        _refresh_recent_runs(summary, log_lines)
         yield "\n".join(log_lines), "Cancelled" if cancelled else "Failed", summary_text(summary)
     finally:
         finish_run()
     preflight_only: bool,
 ) -> Generator[Tuple[str, str, str], None, None]:
     cycle_index = 1
+    consecutive_failures = 0
     continuous_mode = bool(continuous_mode)
     if preflight_only and continuous_mode:
         continuous_mode = False
             yield compose_ops_console(session_logs, final_summary_json), stop_status, final_visual
             break
+        if final_result in {"completed", "preflight_passed"}:
+            consecutive_failures = 0
+        else:
+            consecutive_failures += 1
+            if consecutive_failures >= CONTINUOUS_MAX_CONSECUTIVE_FAILURES:
+                session_logs = _merge_log_chunk(
+                    session_logs,
+                    f"[{now_ts()}] Continuous mode halted after {consecutive_failures} consecutive non-success cycles.",
+                )
+                stop_status = f"Cycle {cycle_index}: halted"
+                yield compose_ops_console(session_logs, final_summary_json), stop_status, final_visual
+                break
         session_logs = _merge_log_chunk(
             session_logs,
             f"[{now_ts()}] Continuous mode: cycle {cycle_index} finished with result="
         )
         restart_status = f"Cycle {cycle_index}: restarting"
         yield compose_ops_console(session_logs, final_summary_json), restart_status, final_visual
+        if CONTINUOUS_RESTART_DELAY_SECONDS > 0:
+            cooldown_s = CONTINUOUS_RESTART_DELAY_SECONDS
+            deadline = time.monotonic() + cooldown_s
+            session_logs = _merge_log_chunk(
+                session_logs,
+                f"[{now_ts()}] Continuous mode cooldown: waiting {cooldown_s}s before next cycle.",
+            )
+            while True:
+                remaining = int(max(0, round(deadline - time.monotonic())))
+                cooldown_status = f"Cycle {cycle_index}: cooldown {remaining}s"
+                yield compose_ops_console(session_logs, final_summary_json), cooldown_status, final_visual
+                if remaining <= 0:
+                    break
+                if is_cancel_requested():
+                    session_logs = _merge_log_chunk(
+                        session_logs,
+                        f"[{now_ts()}] Continuous mode cooldown interrupted by user cancellation.",
+                    )
+                    stop_status = f"Cycle {cycle_index}: stopped"
+                    yield compose_ops_console(session_logs, final_summary_json), stop_status, final_visual
+                    return
+                time.sleep(1)
         cycle_index += 1

requirements.txt CHANGED Viewed

@@ -8,3 +8,4 @@ bitsandbytes>=0.45.0,<1
 huggingface_hub>=0.26.0,<1
 pyyaml>=6.0.2,<7
 sentencepiece>=0.2.0,<1

 huggingface_hub>=0.26.0,<1
 pyyaml>=6.0.2,<7
 sentencepiece>=0.2.0,<1
+protobuf>=4.25.0,<6

scripts/eval_sota.py CHANGED Viewed

@@ -326,7 +326,13 @@ def load_model_and_tokenizer(
     adapter_path: Optional[Path],
     trust_remote_code: bool,
 ) -> Tuple[Any, AutoTokenizer]:
-    tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=trust_remote_code, use_fast=True)
     if tokenizer.pad_token is None:
         tokenizer.pad_token = tokenizer.eos_token or tokenizer.unk_token
     if tokenizer.pad_token is None:

     adapter_path: Optional[Path],
     trust_remote_code: bool,
 ) -> Tuple[Any, AutoTokenizer]:
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=trust_remote_code, use_fast=True)
+    except ImportError as exc:
+        if "protobuf" not in str(exc).lower():
+            raise
+        print("protobuf missing for fast tokenizer. Retrying with use_fast=False.")
+        tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=trust_remote_code, use_fast=False)
     if tokenizer.pad_token is None:
         tokenizer.pad_token = tokenizer.eos_token or tokenizer.unk_token
     if tokenizer.pad_token is None:

scripts/preflight_check.py ADDED Viewed

	@@ -0,0 +1,204 @@

+#!/usr/bin/env python3
+"""Production preflight checks for the Math Conjecture Trainer Space."""
+from __future__ import annotations
+import argparse
+import importlib
+import json
+import os
+import subprocess
+import sys
+from dataclasses import dataclass
+from pathlib import Path
+from typing import Any, Callable, Dict, List
+import yaml
+ROOT = Path(__file__).resolve().parents[1]
+CONFIG_PATH = ROOT / "configs" / "deepseek_math_sota.yaml"
+HF_HOME_DIR = ROOT / "workspace" / ".hf_home"
+HF_DATASETS_CACHE_DIR = HF_HOME_DIR / "datasets"
+HF_HUB_CACHE_DIR = HF_HOME_DIR / "hub"
+@dataclass
+class CheckResult:
+    name: str
+    ok: bool
+    detail: str
+def check_required_files() -> str:
+    required = [
+        ROOT / "app.py",
+        ROOT / "scripts" / "train_sota.py",
+        ROOT / "scripts" / "eval_sota.py",
+        CONFIG_PATH,
+        ROOT / "requirements.txt",
+    ]
+    missing = [str(path) for path in required if not path.exists()]
+    if missing:
+        raise FileNotFoundError("Missing required files: " + ", ".join(missing))
+    return f"{len(required)} required files present."
+def check_config_shape() -> str:
+    cfg = yaml.safe_load(CONFIG_PATH.read_text(encoding="utf-8"))
+    if not isinstance(cfg, dict):
+        raise ValueError("Config root must be a mapping.")
+    required_sections = ("model", "data", "stages")
+    for section in required_sections:
+        if section not in cfg:
+            raise ValueError(f"Missing config section: {section}")
+    stages = cfg.get("stages")
+    if not isinstance(stages, list) or not stages:
+        raise ValueError("Config must contain at least one stage.")
+    return f"Config valid with {len(stages)} stage(s)."
+def check_python_imports() -> str:
+    modules = [
+        "gradio",
+        "torch",
+        "yaml",
+        "huggingface_hub",
+        "datasets",
+        "transformers",
+        "peft",
+    ]
+    versions: Dict[str, str] = {}
+    for module_name in modules:
+        mod = importlib.import_module(module_name)
+        versions[module_name] = str(getattr(mod, "__version__", "unknown"))
+    return "Imports OK: " + ", ".join(f"{k}={v}" for k, v in versions.items())
+def check_module_integrity() -> str:
+    root_str = str(ROOT)
+    if root_str not in sys.path:
+        sys.path.insert(0, root_str)
+    app = importlib.import_module("app")
+    train_sota = importlib.import_module("scripts.train_sota")
+    eval_sota = importlib.import_module("scripts.eval_sota")
+    runtime = app.run_runtime_snapshot()
+    if not isinstance(runtime, dict):
+        raise ValueError("Runtime snapshot is not a dictionary.")
+    if "python" not in runtime or "torch" not in runtime:
+        raise ValueError("Runtime snapshot missing expected keys.")
+    train_cfg = train_sota.load_config(CONFIG_PATH)
+    eval_cfg = eval_sota.load_config(CONFIG_PATH)
+    if not isinstance(train_cfg, dict) or not isinstance(eval_cfg, dict):
+        raise ValueError("Config loaders did not return dictionaries.")
+    return "App/train/eval module imports and config loaders are healthy."
+def run_optional_training_dry_run(timeout_seconds: int) -> str:
+    HF_HOME_DIR.mkdir(parents=True, exist_ok=True)
+    HF_DATASETS_CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    HF_HUB_CACHE_DIR.mkdir(parents=True, exist_ok=True)
+    env = os.environ.copy()
+    env.setdefault("HF_HOME", str(HF_HOME_DIR))
+    env.setdefault("HF_DATASETS_CACHE", str(HF_DATASETS_CACHE_DIR))
+    env.setdefault("HUGGINGFACE_HUB_CACHE", str(HF_HUB_CACHE_DIR))
+    cmd = [
+        sys.executable,
+        str(ROOT / "scripts" / "train_sota.py"),
+        "--config",
+        str(CONFIG_PATH),
+        "--start-stage",
+        "1",
+        "--max-stages",
+        "1",
+        "--dry-run",
+    ]
+    completed = subprocess.run(
+        cmd,
+        cwd=str(ROOT),
+        check=False,
+        env=env,
+        stdout=subprocess.PIPE,
+        stderr=subprocess.STDOUT,
+        text=True,
+        timeout=timeout_seconds,
+    )
+    if completed.returncode != 0:
+        tail = "\n".join((completed.stdout or "").splitlines()[-30:])
+        raise RuntimeError(f"Dry-run failed with exit code {completed.returncode}.\n{tail}")
+    return "Optional training dry-run passed."
+def run_checks(checks: List[tuple[str, Callable[[], str]]]) -> List[CheckResult]:
+    out: List[CheckResult] = []
+    for name, fn in checks:
+        try:
+            detail = fn()
+            out.append(CheckResult(name=name, ok=True, detail=detail))
+        except Exception as exc:
+            out.append(CheckResult(name=name, ok=False, detail=f"{type(exc).__name__}: {exc}"))
+    return out
+def parse_args() -> argparse.Namespace:
+    parser = argparse.ArgumentParser(description="Run production preflight checks for the Space trainer.")
+    parser.add_argument(
+        "--run-training-dry-run",
+        action="store_true",
+        help="Also execute scripts/train_sota.py in --dry-run mode (stage 1 only).",
+    )
+    parser.add_argument(
+        "--dry-run-timeout-seconds",
+        type=int,
+        default=1800,
+        help="Timeout for optional training dry-run step.",
+    )
+    parser.add_argument(
+        "--json",
+        action="store_true",
+        help="Print machine-readable JSON output.",
+    )
+    return parser.parse_args()
+def main() -> None:
+    args = parse_args()
+    checks: List[tuple[str, Callable[[], str]]] = [
+        ("required_files", check_required_files),
+        ("config_shape", check_config_shape),
+        ("python_imports", check_python_imports),
+        ("module_integrity", check_module_integrity),
+    ]
+    if args.run_training_dry_run:
+        checks.append(
+            (
+                "training_dry_run",
+                lambda: run_optional_training_dry_run(timeout_seconds=max(30, args.dry_run_timeout_seconds)),
+            )
+        )
+    results = run_checks(checks)
+    ok = all(item.ok for item in results)
+    payload: Dict[str, Any] = {
+        "ok": ok,
+        "checks": [{"name": item.name, "ok": item.ok, "detail": item.detail} for item in results],
+    }
+    if args.json:
+        print(json.dumps(payload, ensure_ascii=True, indent=2))
+    else:
+        for item in results:
+            status = "PASS" if item.ok else "FAIL"
+            print(f"[{status}] {item.name}: {item.detail}")
+        print("Overall:", "PASS" if ok else "FAIL")
+    if not ok:
+        raise SystemExit(1)
+if __name__ == "__main__":
+    main()

scripts/train_sota.py CHANGED Viewed

@@ -430,11 +430,22 @@ def build_tokenizer(model_cfg: Dict[str, Any]) -> AutoTokenizer:
     base_model = as_text(model_cfg.get("base_model"))
     if not base_model:
         raise ValueError("model.base_model is required.")
-    tokenizer = AutoTokenizer.from_pretrained(
-        base_model,
-        trust_remote_code=bool(model_cfg.get("trust_remote_code", False)),
-        use_fast=True,
-    )
     if tokenizer.pad_token is None:
         tokenizer.pad_token = tokenizer.eos_token or tokenizer.unk_token
     if tokenizer.pad_token is None:
@@ -512,6 +523,12 @@ def build_model_and_tokenizer(model_cfg: Dict[str, Any], training_defaults: Dict
     return model, tokenizer
 class WeightedLossCollator:
     def __init__(self, tokenizer: AutoTokenizer, model: Any) -> None:
         self.base = DataCollatorForSeq2Seq(
@@ -909,7 +926,14 @@ def main() -> None:
             raise ValueError("Hub push requested but repo_id is missing.")
     if args.dry_run:
-        tokenizer = build_tokenizer(cfg["model"])
         model = None
     else:
         model, tokenizer = build_model_and_tokenizer(cfg["model"], cfg.get("training_defaults", {}))

     base_model = as_text(model_cfg.get("base_model"))
     if not base_model:
         raise ValueError("model.base_model is required.")
+    trust_remote_code = bool(model_cfg.get("trust_remote_code", False))
+    try:
+        tokenizer = AutoTokenizer.from_pretrained(
+            base_model,
+            trust_remote_code=trust_remote_code,
+            use_fast=True,
+        )
+    except ImportError as exc:
+        if "protobuf" not in str(exc).lower():
+            raise
+        print("protobuf missing for fast tokenizer. Retrying with use_fast=False.")
+        tokenizer = AutoTokenizer.from_pretrained(
+            base_model,
+            trust_remote_code=trust_remote_code,
+            use_fast=False,
+        )
     if tokenizer.pad_token is None:
         tokenizer.pad_token = tokenizer.eos_token or tokenizer.unk_token
     if tokenizer.pad_token is None:
     return model, tokenizer
+class DryRunTokenizerFallback:
+    """Minimal tokenizer-like object for dry-run prompt-format checks."""
+    chat_template = None
 class WeightedLossCollator:
     def __init__(self, tokenizer: AutoTokenizer, model: Any) -> None:
         self.base = DataCollatorForSeq2Seq(
             raise ValueError("Hub push requested but repo_id is missing.")
     if args.dry_run:
+        try:
+            tokenizer = build_tokenizer(cfg["model"])
+        except Exception as exc:
+            print(
+                "Dry-run warning: tokenizer load failed; using formatting-only fallback tokenizer. "
+                f"Reason: {type(exc).__name__}: {exc}"
+            )
+            tokenizer = DryRunTokenizerFallback()
         model = None
     else:
         model, tokenizer = build_model_and_tokenizer(cfg["model"], cfg.get("training_defaults", {}))

tests/test_core_utils.py ADDED Viewed

	@@ -0,0 +1,246 @@

+#!/usr/bin/env python3
+"""Production safety tests for key pipeline utilities."""
+from __future__ import annotations
+import json
+import tempfile
+import unittest
+from unittest import mock
+from pathlib import Path
+import app
+from scripts import eval_sota
+from scripts import train_sota
+class AppUtilityTests(unittest.TestCase):
+    def test_validate_repo_id_accepts_valid(self) -> None:
+        self.assertEqual(
+            app.validate_repo_id("NorthernTribe-Research/math_trainer", "Model repo"),
+            "NorthernTribe-Research/math_trainer",
+        )
+    def test_validate_repo_id_rejects_invalid(self) -> None:
+        with self.assertRaises(ValueError):
+            app.validate_repo_id("invalid repo id", "Model repo")
+    def test_merge_log_chunk_truncates(self) -> None:
+        merged = app._merge_log_chunk("a" * 9, "b" * 9, max_chars=10)
+        self.assertEqual(len(merged), 10)
+        self.assertTrue(merged.endswith("b" * 9))
+    def test_build_stage_timeline_returns_list_markup(self) -> None:
+        stage_meta = {"start_stage": 1, "stage_count": 2, "completed": 1, "active_stage": 2}
+        html = app._build_stage_timeline({}, stage_meta)
+        self.assertIn("ops-stage-list", html)
+        self.assertIn("ops-stage-item", html)
+    def test_validate_stage_window_rejects_overflow(self) -> None:
+        with self.assertRaises(ValueError):
+            app.validate_stage_window(app.TEMPLATE_STAGE_COUNT, 2)
+    def test_build_recent_runs_panel_markup(self) -> None:
+        summary = {
+            "recent_runs": [
+                {
+                    "run_label": "run-20260101-000000",
+                    "result": "completed",
+                    "duration_seconds": 42,
+                    "finished_at_utc": "2026-01-01 00:00:42 UTC",
+                    "evaluation": {"pass_at_1": 0.11, "pass_at_k": 0.27, "evaluated_rows": 128},
+                }
+            ]
+        }
+        html = app._build_recent_runs_panel(summary)
+        self.assertIn("ops-run-list", html)
+        self.assertIn("run-20260101-000000", html)
+        self.assertIn("completed", html)
+    def test_persist_run_artifacts_updates_history(self) -> None:
+        with tempfile.TemporaryDirectory() as tmpdir:
+            history_path = Path(tmpdir) / "run_history.json"
+            records_dir = Path(tmpdir) / "run_records"
+            summary = {
+                "run_label": "run-20260102-030405",
+                "result": "completed",
+                "started_at_utc": "2026-01-02 03:04:05 UTC",
+                "finished_at_utc": "2026-01-02 03:04:35 UTC",
+                "evaluation": {"pass_at_1": 0.1, "pass_at_k": 0.2, "evaluated_rows": 64},
+            }
+            with mock.patch.object(app, "RUN_HISTORY_PATH", history_path):
+                with mock.patch.object(app, "RUN_RECORDS_DIR", records_dir):
+                    warning = app.persist_run_artifacts(summary)
+            self.assertIsNone(warning)
+            self.assertTrue(history_path.exists())
+            payload = json.loads(history_path.read_text(encoding="utf-8"))
+            self.assertEqual(payload[0]["run_label"], "run-20260102-030405")
+            self.assertEqual(payload[0]["result"], "completed")
+            self.assertTrue((records_dir / "run-20260102-030405.json").exists())
+class EvalUtilityTests(unittest.TestCase):
+    def test_parse_numeric_fraction(self) -> None:
+        value = eval_sota.parse_numeric_value("3/4")
+        self.assertIsNotNone(value)
+        assert value is not None
+        self.assertAlmostEqual(value, 0.75, places=8)
+    def test_match_candidate_boxed(self) -> None:
+        result = eval_sota.match_candidate(r"\boxed{42}", ["42"])
+        self.assertTrue(result["match"])
+        self.assertTrue(result["boxed"] or result["exact"])
+class TrainUtilityTests(unittest.TestCase):
+    def test_as_bool_conversions(self) -> None:
+        self.assertTrue(train_sota.as_bool("yes"))
+        self.assertFalse(train_sota.as_bool("no"))
+        self.assertTrue(train_sota.as_bool(True))
+        self.assertFalse(train_sota.as_bool(None, default=False))
+    def test_build_tokenizer_falls_back_when_protobuf_missing(self) -> None:
+        class DummyTokenizer:
+            def __init__(self) -> None:
+                self.pad_token = None
+                self.eos_token = "<eos>"
+                self.unk_token = "<unk>"
+            def add_special_tokens(self, tokens):
+                self.pad_token = tokens.get("pad_token")
+        calls = []
+        def fake_from_pretrained(*args, **kwargs):
+            calls.append(kwargs.get("use_fast"))
+            if kwargs.get("use_fast"):
+                raise ImportError("requires the protobuf library")
+            return DummyTokenizer()
+        with mock.patch.object(train_sota.AutoTokenizer, "from_pretrained", side_effect=fake_from_pretrained):
+            tok = train_sota.build_tokenizer({"base_model": "dummy/model", "trust_remote_code": False})
+        self.assertEqual(calls, [True, False])
+        self.assertEqual(tok.pad_token, "<eos>")
+class EvalTokenizerFallbackTests(unittest.TestCase):
+    def test_eval_tokenizer_falls_back_when_protobuf_missing(self) -> None:
+        class DummyTokenizer:
+            def __init__(self) -> None:
+                self.pad_token = None
+                self.eos_token = "<eos>"
+                self.unk_token = "<unk>"
+            def add_special_tokens(self, tokens):
+                self.pad_token = tokens.get("pad_token")
+        class DummyModel:
+            def eval(self):
+                return None
+        calls = []
+        def fake_tok_from_pretrained(*args, **kwargs):
+            calls.append(kwargs.get("use_fast"))
+            if kwargs.get("use_fast"):
+                raise ImportError("requires the protobuf library")
+            return DummyTokenizer()
+        with mock.patch.object(eval_sota.AutoTokenizer, "from_pretrained", side_effect=fake_tok_from_pretrained):
+            with mock.patch.object(eval_sota.AutoModelForCausalLM, "from_pretrained", return_value=DummyModel()):
+                model, tok = eval_sota.load_model_and_tokenizer(
+                    base_model="dummy/model",
+                    adapter_path=None,
+                    trust_remote_code=False,
+                )
+        self.assertIsNotNone(model)
+        self.assertEqual(calls, [True, False])
+        self.assertEqual(tok.pad_token, "<eos>")
+class ContinuousModeSafetyTests(unittest.TestCase):
+    def test_continuous_mode_halts_after_consecutive_failures(self) -> None:
+        original_max = app.CONTINUOUS_MAX_CONSECUTIVE_FAILURES
+        original_delay = app.CONTINUOUS_RESTART_DELAY_SECONDS
+        app.CONTINUOUS_MAX_CONSECUTIVE_FAILURES = 2
+        app.CONTINUOUS_RESTART_DELAY_SECONDS = 0
+        self.addCleanup(setattr, app, "CONTINUOUS_MAX_CONSECUTIVE_FAILURES", original_max)
+        self.addCleanup(setattr, app, "CONTINUOUS_RESTART_DELAY_SECONDS", original_delay)
+        def fake_pipeline_core(**kwargs):
+            summary = json.dumps({"result": "failed"})
+            yield "line-1", "Failed", summary
+        with mock.patch.object(app, "run_pipeline_core", side_effect=fake_pipeline_core):
+            outputs = list(
+                app.run_pipeline(
+                    dataset_repo_id="owner/dataset",
+                    model_repo_id="owner/model",
+                    base_model_id="model/base",
+                    autonomous_mode=False,
+                    continuous_mode=True,
+                    start_stage=1,
+                    max_stages=1,
+                    run_eval=False,
+                    eval_k=1,
+                    eval_samples=50,
+                    enforce_quality_gate=False,
+                    gate_min_pass_at_1=0.0,
+                    gate_min_pass_at_k=0.0,
+                    gate_min_rows=10,
+                    push_to_hub=False,
+                    force_redownload=False,
+                    preflight_only=False,
+                )
+            )
+        self.assertGreaterEqual(len(outputs), 3)
+        last_status = outputs[-1][1]
+        self.assertIn("halted", last_status.lower())
+    def test_continuous_mode_cooldown_stops_on_cancel(self) -> None:
+        original_max = app.CONTINUOUS_MAX_CONSECUTIVE_FAILURES
+        original_delay = app.CONTINUOUS_RESTART_DELAY_SECONDS
+        app.CONTINUOUS_MAX_CONSECUTIVE_FAILURES = 3
+        app.CONTINUOUS_RESTART_DELAY_SECONDS = 1
+        self.addCleanup(setattr, app, "CONTINUOUS_MAX_CONSECUTIVE_FAILURES", original_max)
+        self.addCleanup(setattr, app, "CONTINUOUS_RESTART_DELAY_SECONDS", original_delay)
+        def fake_pipeline_core(**kwargs):
+            summary = json.dumps({"result": "completed"})
+            yield "line-1", "Completed", summary
+        with mock.patch.object(app, "run_pipeline_core", side_effect=fake_pipeline_core):
+            with mock.patch.object(app, "is_cancel_requested", return_value=True):
+                outputs = list(
+                    app.run_pipeline(
+                        dataset_repo_id="owner/dataset",
+                        model_repo_id="owner/model",
+                        base_model_id="model/base",
+                        autonomous_mode=False,
+                        continuous_mode=True,
+                        start_stage=1,
+                        max_stages=1,
+                        run_eval=False,
+                        eval_k=1,
+                        eval_samples=50,
+                        enforce_quality_gate=False,
+                        gate_min_pass_at_1=0.0,
+                        gate_min_pass_at_k=0.0,
+                        gate_min_rows=10,
+                        push_to_hub=False,
+                        force_redownload=False,
+                        preflight_only=False,
+                    )
+                )
+        self.assertGreaterEqual(len(outputs), 3)
+        self.assertIn("stopped", outputs[-1][1].lower())
+if __name__ == "__main__":
+    unittest.main(verbosity=2)