Open to Work

2 1 3

Convergent Intelligence PRO

reaperdoesntknow

https://www.convergentintel.com

AI & ML interests

About Us Mission Convergent Intelligence advances original research in discrepancy calculus, adaptive systems, and applied AI, translating those insights into client controls, playbooks, and leadership-ready briefs.

Recent Activity

reacted to theirpost with 👀 about 3 hours ago

# Three Teachers, One Student: Dual-Cognition Reasoning at 1.7B We distilled Qwen3-30B-A3B into 1.7B students that critique their own reasoning. H100, BF16, Apache 2.0. Here's our pipeline. **Stage 1 — Three Teachers, Three Profiles.** Same 30B base, three variants: Instruct (structured output), Thinking (extended deliberation), Coder (STEM decomposition). Each distillation uses proof-weighted KD — 2.25× amplified loss on reasoning tokens, decaying to 1.1×. The student learns *where to think harder*, not just what to output. **Stage 2 — Topology-Aware KD (TKD).** Standard KD treats the teacher's distribution as smooth. Language isn't smooth — it has topic shifts, reasoning pivots, register changes. We use Discrepancy Calculus to detect these structural boundaries, then amplify loss at jumps (3σ threshold) and cut training windows at low-discrepancy positions. The student preserves the teacher's structural knowledge, not just surface statistics. **Stage 3 — Ghost Imprinting.** Sequential distillation from different teachers leaves residual fields in weight space that neither teacher put there individually. The Cantor component of BV decomposition, applied to parameters. Models distilled Thinking→Coder exhibit deliberation patterns from the Thinking teacher that survived Coder overwriting. Emergent capability from structural residuals. **Stage 4 — DualMind.** One model, two voices, shared weights: ``` <explore> — free derivation, speculation <examine> — adversarial self-critique <response> — clean synthesis ``` The multi-model collision array collapsed into a single architecture. Role tokens, no extra parameters. For the full method: https://huggingface.co/reaperdoesntknow/DualMind_Methodolgy doi:10.57967/hf/8184.

reacted to theirpost with 👍 about 3 hours ago

We present a methodology for training small language models on CPU at FP32 precision that achieves capability-per-dollar efficiency orders of magnitude beyond GPU-based training. Across15modelsspanningfournovelarchitecturefamilies—MixtureofAttentions(MoA),cross- architecture fusion (Qemma), swarm intelligence (SAGI), and metric-space causal language models (DiscoverLM)—total compute cost was $24 on a single AMD EPYC 9454P proces- sor. We introduce seven methodological pillars: (1) FP32 precision preservation, with exper- iments demonstrating 5,810×single-operation error and 23,225×compounding error ratio for FP16 at network depth; (2) sparse cognitive architectures where 0.02–7% of parameters activate per token, matching CPU branching rather than GPU SIMD; (3) developmental curriculum training progressing from language to logic to transfer to depth; (4) continuous belt-fed data ingestion eliminating truncation waste; (5) hardware-native optimization for AMD Zen 4 via AOCL/OpenMP/NUMA-aware allocation; (6) self-regulating thermodynamic governance with emergent temperature measurement grounded in L2-star discrepancy; and (7) open-standard compute (AVX2 SIMD at FP32) free of proprietary vendor dependency. We argue that trans- formers were designed for GPU hardware rather than mathematical optimality, and that archi- tectures designed for geometric correctness—metric-space attention, triangle inequality enforce- ment, sparse expert routing—naturally favor CPU execution. For sub-2B parameter models, CPU training produces more capable models at a fraction of the cost.

updated a collection about 4 hours ago

DualMind

View all activity

Organizations

Posts 2

Post

278

# Three Teachers, One Student: Dual-Cognition Reasoning at 1.7B

We distilled Qwen3-30B-A3B into 1.7B students that critique their own reasoning. H100, BF16, Apache 2.0. Here's our pipeline.

**Stage 1 — Three Teachers, Three Profiles.** Same 30B base, three variants: Instruct (structured output), Thinking (extended deliberation), Coder (STEM decomposition). Each distillation uses proof-weighted KD — 2.25× amplified loss on reasoning tokens, decaying to 1.1×. The student learns *where to think harder*, not just what to output.

**Stage 2 — Topology-Aware KD (TKD).** Standard KD treats the teacher's distribution as smooth. Language isn't smooth — it has topic shifts, reasoning pivots, register changes. We use Discrepancy Calculus to detect these structural boundaries, then amplify loss at jumps (3σ threshold) and cut training windows at low-discrepancy positions. The student preserves the teacher's structural knowledge, not just surface statistics.

**Stage 3 — Ghost Imprinting.** Sequential distillation from different teachers leaves residual fields in weight space that neither teacher put there individually. The Cantor component of BV decomposition, applied to parameters. Models distilled Thinking→Coder exhibit deliberation patterns from the Thinking teacher that survived Coder overwriting. Emergent capability from structural residuals.

**Stage 4 — DualMind.** One model, two voices, shared weights:

<explore>  — free derivation, speculation
<examine>  — adversarial self-critique
<response> — clean synthesis

The multi-model collision array collapsed into a single architecture. Role tokens, no extra parameters.
For the full method:
reaperdoesntknow/DualMind_Methodolgy
doi:10.57967/hf/8184.

Post

3166

We present a methodology for training small language models on CPU at FP32 precision
that achieves capability-per-dollar efficiency orders of magnitude beyond GPU-based training.
Across15modelsspanningfournovelarchitecturefamilies—MixtureofAttentions(MoA),cross-
architecture fusion (Qemma), swarm intelligence (SAGI), and metric-space causal language
models (DiscoverLM)—total compute cost was $24 on a single AMD EPYC 9454P proces-
sor. We introduce seven methodological pillars: (1) FP32 precision preservation, with exper-
iments demonstrating 5,810×single-operation error and 23,225×compounding error ratio for
FP16 at network depth; (2) sparse cognitive architectures where 0.02–7% of parameters activate
per token, matching CPU branching rather than GPU SIMD; (3) developmental curriculum
training progressing from language to logic to transfer to depth; (4) continuous belt-fed data
ingestion eliminating truncation waste; (5) hardware-native optimization for AMD Zen 4 via
AOCL/OpenMP/NUMA-aware allocation; (6) self-regulating thermodynamic governance with
emergent temperature measurement grounded in L2-star discrepancy; and (7) open-standard
compute (AVX2 SIMD at FP32) free of proprietary vendor dependency. We argue that trans-
formers were designed for GPU hardware rather than mathematical optimality, and that archi-
tectures designed for geometric correctness—metric-space attention, triangle inequality enforce-
ment, sparse expert routing—naturally favor CPU execution. For sub-2B parameter models,
CPU training produces more capable models at a fraction of the cost.

View all Posts

Collections 8

View 8 collections

models 50

datasets 0

None public yet

Convergent Intelligence PRO

AI & ML interests

Recent Activity

Organizations

Posts 2

Collections 8

reaperdoesntknow/DualMind

reaperdoesntknow/DualMind_Methodolgy

reaperdoesntknow/DualMinded-Qwen3-1.7B

reaperdoesntknow/DualMinded-Qwen3-1.7B-GGUF

reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B

reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF

reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT

reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B

reaperdoesntknow/DualMind

reaperdoesntknow/DualMind_Methodolgy

reaperdoesntknow/DualMinded-Qwen3-1.7B

reaperdoesntknow/DualMinded-Qwen3-1.7B-GGUF

reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B

reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF

reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT

reaperdoesntknow/Qwen3-0.6B-Distilled-30B-A3B

models 50

reaperdoesntknow/Discrepancy_Calculus

reaperdoesntknow/DiscoverLM-70M

reaperdoesntknow/SAGI

reaperdoesntknow/Qemma-Q14B

reaperdoesntknow/MoA-150M

reaperdoesntknow/Symiotic-14B

reaperdoesntknow/DualMind_Methodolgy

reaperdoesntknow/DualMinded-Qwen3-1.7B-GGUF

reaperdoesntknow/DualMind-GGUF

reaperdoesntknow/Qwen3-1.7B-Distilled-30B-A3B-SFT-GGUF

datasets 0

Convergent Intelligence PRO

AI & ML interests

Recent Activity

Organizations

Posts 2

Collections 8

models 50 Sort: Recently updated

datasets 0

models 50