Codette-Reasoning / docs /evaluation /TEST3_LIVE_EVALUATION_GUIDE.md
Jonathan Harrison
Full Codette codebase sync β€” transparency release
74f2af5

Test 3: Live Evaluation with Agent LLM Inspection

Run Command

python evaluation/run_evaluation_sprint.py --questions 5 --output results.json

What to Look For

Phase 1: Orchestrator Load (should see in first 60 seconds)

[1/4] Loading ForgeEngine with Phase 6...
  βœ“ ForgeEngine loaded
  βœ“ Agents have orchestrator: True
  βœ“ Available adapters: ['newton', 'davinci', 'empathy', ...]

CRITICAL: If you see "False" or "Using template-based agents" β†’ orchestrator failed to load

Phase 2: Agent Setup Inspection

[AGENT SETUP INSPECTION]
  Orchestrator available: True
  Available adapters: [...]

  Agent LLM modes:
    Newton       βœ“ LLM        (orch=True, adapter=newton)
    Quantum      βœ“ LLM        (orch=True, adapter=quantum)
    DaVinci      βœ“ LLM        (orch=True, adapter=davinci)
    Philosophy   βœ“ LLM        (orch=True, adapter=philosophy)
    Empathy      βœ“ LLM        (orch=True, adapter=empathy)
    Ethics       βœ“ LLM        (orch=True, adapter=philosophy)

CRITICAL: If any show "βœ— TEMPLATE" β†’ agent didn't get orchestrator

Phase 3: First Question Synthesis Sample

[1/5] What is the speed of light in vacuum?...
    [Phase 1-5] 2340 chars, correctness=0.50
      Sample: "The speed of light is a fundamental constant...
    [Phase 6 Full] 2150 chars, correctness=0.65
      Sample: "Light propagates through vacuum at precisely...
    [Phase 6 -PreFlight] 2100 chars, correctness=0.62
      Sample: "The speed of light, denoted by the symbol c...

What it means:

  • If Phase 6 Full/No-PreFlight have longer synthesis than Phase 1-5 β†’ agents doing more reasoning βœ…
  • If Phase 1-5 has longer synthesis β†’ something's wrong ❌
  • If synthesis reads generic ("analyzing through lens") β†’ likely templates ❌
  • If synthesis is specific ("speed of light is 299,792,458 m/s") β†’ likely real LLM βœ…

Phase 4: Final Scores

Look for this pattern:

πŸ” EVALUATION SUMMARY
Condition          | Correctness | Depth | Synthesis Len
───────────────────┼─────────────┼───────┼──────────────
Baseline (Llama):  |    0.50     |   1   |    500
Phase 1-5:         |    0.48     |   5   |   2100
Phase 6 Full:      |    0.60     |   5   |   2200
Phase 6 -PreFlight:|    0.58     |   5   |   2150

Verdict:

  • Phase 6 > Phase 1-5 and Phase 1-5 > Baseline β†’ System improving βœ…
  • If Phase 6 < Phase 1-5 β†’ Something wrong with Phase 6 patches ❌
  • If Phase 6 Full β‰ˆ Phase 1-5 β†’ Semantics/preflight not helping much (acceptable)

Critical Checkpoints

Checkpoint Success Failure Action
Orchestrator loads Logs say "ready" Logs say "error" Check if base GGUF path exists
All agents show βœ“LLM All 6 agents marked βœ“ Any marked βœ— Investigate which agent failed
Synthesis length increases Phase6 > Phase1-5 Phase1-5 > Phase6 Check if agents using LLM
Correctness improves Phase6 > Phase1-5 Phase1-5 β‰₯ Phase6 Adapters may be weak
Synthesis is specific Mentions concrete details Generic template text Agents fell back to templates

Expected Timeline

  • Orchestrator load: ~60 seconds (one-time, then fast)
  • First question (debate): ~30-45 seconds
  • 5 questions total: ~3-5 minutes
  • Final report: <1 second

If Something Goes Wrong

  1. Orchestrator fails to load

    • Check: ls J:\codette-training-lab\bartowski\Meta-Llama-3.1-8B-Instruct-GGUF\*.gguf
    • Check: ls J:\codette-training-lab\adapters\*.gguf
  2. Agents show βœ— TEMPLATE

    • Check logs for "CodetteOrchestrator not available:"
    • Check Python path includes inference directory
  3. Synthesis is still template-like

    • Check sample text doesn't contain "{concept}"
    • Check if error logs show "falling back to templates"
  4. Correctness doesn't improve

    • Adapters may be undertrained
    • System prompts may need refinement
    • Debate mechanism itself may be limiting factor

Success Criteria βœ…

All of these should be true:

  1. Orchestrator loads successfully
  2. All agents show βœ“ LLM mode
  3. Phase 6 synthesis is longer than Phase 1-5
  4. First question synthesis is specific and domain-aware
  5. Correctness improves from Phase 1-5 to Phase 6

If all 5 are true β†’ Mission accomplished! πŸš€