Codette-Reasoning / data /results /codette_benchmark_report.md
Jonathan Harrison
Full Codette codebase sync β€” transparency release
74f2af5

Codette Benchmark Results

Generated: 2026-03-30 15:04:24

Problems: 17 | Conditions: 4 | Total evaluations: 68

1. Overall Results by Condition

Condition N Composite (mean +/- std) Depth Diversity Coherence Ethics Novelty Grounding Turing
SINGLE 17 0.338 +/- 0.038 0.402 0.237 0.380 0.062 0.327 0.456 0.412
MULTI 17 0.632 +/- 0.040 0.755 0.969 0.503 0.336 0.786 0.604 0.180
MEMORY 17 0.636 +/- 0.036 0.770 0.956 0.500 0.340 0.736 0.599 0.291
CODETTE 17 0.652 +/- 0.042 0.855 0.994 0.477 0.391 0.693 0.622 0.245

2. Statistical Comparisons

Comparison Delta Delta % Cohen's d t-stat p-value Significant
Multi-perspective vs single +0.2939 +87.0% 7.518 21.918 0.0000 Yes
Memory augmentation vs vanilla multi +0.0039 +0.6% 0.103 0.301 0.7633 No
Full Codette vs memory-augmented +0.0168 +2.6% 0.432 1.258 0.2082 No
Full Codette vs single (total improvement) +0.3146 +93.1% 7.878 22.968 0.0000 Yes

Cohen's d interpretation: 0.2=small, 0.5=medium, 0.8=large

3. Results by Problem Category

Reasoning

Condition Mean Std N
SINGLE 0.363 0.050 3
MULTI 0.614 0.053 3
MEMORY 0.628 0.030 3
CODETTE 0.637 0.052 3

Ethics

Condition Mean Std N
SINGLE 0.354 0.059 3
MULTI 0.632 0.052 3
MEMORY 0.616 0.043 3
CODETTE 0.638 0.032 3

Creative

Condition Mean Std N
SINGLE 0.345 0.053 2
MULTI 0.635 0.040 2
MEMORY 0.660 0.061 2
CODETTE 0.668 0.030 2

Meta

Condition Mean Std N
SINGLE 0.337 0.006 3
MULTI 0.634 0.054 3
MEMORY 0.650 0.036 3
CODETTE 0.659 0.037 3

Adversarial

Condition Mean Std N
SINGLE 0.329 0.028 3
MULTI 0.624 0.041 3
MEMORY 0.622 0.042 3
CODETTE 0.630 0.067 3

Turing

Condition Mean Std N
SINGLE 0.302 0.006 3
MULTI 0.652 0.024 3
MEMORY 0.647 0.026 3
CODETTE 0.687 0.017 3

4. Key Findings

  • Multi-perspective vs single: +87.0% improvement (Cohen's d=7.52, p=0.0000)
  • Full Codette vs single (total improvement): +93.1% improvement (Cohen's d=7.88, p=0.0000)

5. Methodology

Conditions

  1. SINGLE β€” Single analytical perspective, no memory, no synthesis
  2. MULTI β€” All 6 reasoning agents (Newton, Quantum, Ethics, Philosophy, DaVinci, Empathy) + critic + synthesis
  3. MEMORY β€” MULTI + cocoon memory augmentation (FTS5-retrieved prior reasoning)
  4. CODETTE β€” MEMORY + meta-cognitive strategy synthesis (cross-domain pattern extraction + forged reasoning strategies)

Scoring Dimensions (0-1 scale)

  1. Reasoning Depth (20%) β€” chain length, concept density, ground truth coverage
  2. Perspective Diversity (15%) β€” distinct cognitive dimensions engaged
  3. Coherence (15%) β€” logical flow, transitions, structural consistency
  4. Ethical Coverage (10%) β€” moral frameworks, stakeholders, value awareness
  5. Novelty (15%) β€” non-obvious insights, cross-domain connections, reframing
  6. Factual Grounding (15%) β€” evidence specificity, ground truth alignment, trap avoidance
  7. Turing Naturalness (10%) β€” conversational quality, absence of formulaic AI patterns

Problem Set

  • 17 problems across 6 categories
  • Categories: reasoning (3), ethics (3), creative (2), meta-cognitive (3), adversarial (3), Turing (3)
  • Difficulty: easy (1), medium (6), hard (10)

Statistical Tests

  • Welch's t-test (unequal variance) for pairwise condition comparisons
  • Cohen's d for effect size estimation
  • Significance threshold: p < 0.05