TopologicalQwen

Topology-Aware Knowledge Distillation from Qwen3-30B-A3B → 1.7B

Convergent Intelligence LLC: Research Division


What This Is

TopologicalQwen is a 1.7B parameter model distilled from Qwen3-30B-A3B using Topological Knowledge Distillation (TKD) — a methodology that treats the teacher's output distribution over a concatenated token stream as a bounded variation (BV) function and decomposes knowledge transfer into three channels via the Mesh Fundamental Identity:

  1. Smooth distillation (AC component) — Standard KL divergence over regions where the teacher's distribution varies continuously. This is what every other KD method does and stops at.
  2. Jump corrections (D^j f) — Explicit correction terms at conceptual boundaries where the teacher's distribution exhibits discontinuities. These are the points where topic, register, or reasoning mode shifts — standard KD smears across them, losing structural information.
  3. Drift corrections (D^c f) — The Cantor/singular-continuous component capturing gradual distributional drift that neither the smooth nor jump terms account for. This is the residual structure that emerges in generation quality.

Standard knowledge distillation only handles term (1). TKD captures all three.

Architecture

Parameter Value
Architecture Qwen3ForCausalLM
Parameters ~2.03B (1.7B effective)
Hidden Size 2048
Layers 28
Attention Heads 16 (Q) / 8 (KV) — GQA
Intermediate 6144
Context Length 40,960 tokens
Vocabulary 151,936
Precision FP32 training, BF16/FP16 inference

Training

Student: Disctil-Qwen3-1.7B (DISC-refined uncensored Qwen3) Teacher: Qwen3-30B-A3B-Thinking-2507

Datasets (physics CoT, ~1,599 samples):

  • CoT Differential Equations (636 examples)
  • CoT Theoretical Mechanics (307 examples)
  • CoT Electromagnetism (580 examples)
  • CoT General Relativity (76 examples)

DualMind format — each training sample is restructured into <explore> (derivation), <examine> (verification/self-critique), and <response> (clean answer) blocks. The model learns a cognitive loop: generate reasoning, then critique it, then synthesize.

TKD Pipeline (4 phases)

Phase 1 — Teacher logit caching: Single forward pass through the 30B teacher with top-64 logit compression to disk. One pass, no repeated teacher inference.

Phase 2 — DISC topology pass: Vectorized discrepancy operator maps the knowledge manifold. Jump detection at 3σ threshold with 1.25× amplification. Gap energy density computed over 64-token windows.

Phase 3 — Topology-guided adaptive windowing: 512-token windows cut at low-discrepancy positions (overlap 32–128) rather than fixed stride. The topology tells you where to cut without losing information across boundaries.

Phase 4 — Curriculum-ordered continuous KD: 4-phase curriculum (easiest 30% first). Proof-weighted loss: 2.25× → 1.1× decaying weights on reasoning tokens. KD alpha ramps from 0 → 0.45 (starting at 15% of training, reaching target at 45%). KL divergence at T=2.0. Effective batch size 32 (2 × 16 grad accumulation). Cosine LR: 5e-6 → 5e-7.

Hyperparameters

Parameter Value
Effective batch size 32 (2 × 16 accum)
Learning rate 5e-6 → 5e-7 (cosine)
Warmup steps 30
Weight decay 1e-3
Gradient clip 1.0
Temperature 2.0
KD target α 0.45
Proof weight 2.25 → 1.1
Jump threshold
Jump amplifier 1.25×
Precision BF16 (autocast)

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)

Usage

The model responds in DualMind format: <explore><examine><response>.

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "reaperdoesntknow/TopologicalQwen",
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained("reaperdoesntknow/TopologicalQwen")

# Prompt with DualMind format — start the explore block
prompt = (
    "##USER:\n"
    "Prove that every convergent sequence is a Cauchy sequence.\n\n"
    "<explore>\n"
)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs, max_new_tokens=2048, do_sample=True,
    top_p=0.9, temperature=0.6, repetition_penalty=1.15
)
result = tokenizer.decode(output[0], skip_special_tokens=True)
print(result)

# Verify mode transitions
assert "<explore>" in result and "</explore>" in result   # derivation
assert "<examine>" in result and "</examine>" in result   # self-critique
assert "<response>" in result and "</response>" in result  # clean answer

What the Output Looks Like

<explore>
[Unconstrained derivation — the model works through the proof freely]
</explore>

<examine>
[Adversarial self-response — the model critiques its own derivation]
</examine>

<response>
[Clean final answer synthesized from the internal dialogue]
</response>

This is the multi-model collision array collapsed into a single architecture. The dialectical structure that produces novel insights from architectural diversity is recreated through role-conditioned generation on shared weights.

Distillation Chain

Qwen3-1.7B (base)
  → DiStil-Qwen3-1.7B-uncensored (uncensored SFT)
    → Disctil-Qwen3-1.7B (DISC refinement)
      → TopologicalQwen (TKD from 30B-Thinking teacher + DualMind format) ← you are here

What Makes This Different

The broader Convergent Intelligence portfolio (49 models, 22,500+ downloads) was trained on CPU at FP32 for a total compute cost of $24. That proves the methodology — structure beats scale.

This model is the exception. TopologicalQwen was trained on Colab H100 at BF16 precision with a 30B-parameter teacher. Same TKD methodology, premium compute. This is the DistilQwen collection's answer to "what happens when you give this pipeline real hardware?"

The result: a 1.7B model that exhibits dual-mental-modality reasoning (explore → examine → respond) with structural quality that standard distillation at any precision doesn't produce. The methodology is the constant. The hardware is the variable. Both produce results that shouldn't exist at this parameter count.

Every knowledge distillation method in the literature treats the teacher's output as a smooth function and minimizes KL divergence globally. This works for the easy parts — regions where the teacher's distribution varies slowly. But language has structure: topic shifts, reasoning mode transitions, register changes. At these boundaries, the teacher's distribution jumps. Standard KD averages across these jumps, teaching the student a blurred version of the teacher's actual knowledge.

TKD uses the DISC (Discrepancy Calculus) framework to detect these structural features before training, then allocates capacity and loss weight accordingly. The result is a student that preserves the teacher's structural understanding, not just its surface statistics.

The empirical evidence: this model at 1.7B consistently produces responses with structural reasoning quality that standard distillation at the same parameter count does not achieve.

Mathematical Foundations: Discrepancy Calculus (DISC)

TKD is grounded in Discrepancy Calculus — a measure-theoretic framework that treats singularities as primary structure rather than pathology. The full theory is developed in "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division).

The Core Operator. The discrepancy operator quantifies local mismatch between integration and differentiation:

Df(x)=limε01εxx+εf(t)f(x)txdtDf(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|}\, dt

For smooth $f$: $Df(x) = |f'(x)|$ (classical recovery). For rough $f$: $D$ localizes irregularity to null sets while preserving integral structure.

The Mesh Fundamental Identity. Every function of bounded variation decomposes as:

f(b)f(a)=abf(x)dxsmooth (AC)+xJfΔf(x)jumps+Dcf(I)Cantor driftf(b) - f(a) = \underbrace{\int_a^b f'(x)\,dx}_{\text{smooth (AC)}} + \underbrace{\sum_{x \in J_f} \Delta f(x)}_{\text{jumps}} + \underbrace{D^c f(I)}_{\text{Cantor drift}}

This is the theoretical backbone of TKD. Standard knowledge distillation captures only the first term. TKD preserves all three.

TKD Application. The teacher's output distribution $p_T(x)$ over a concatenated token stream is treated as a BV function. The DISC topology pass computes:

  1. Discrepancy energy $E_{\text{disc}}[p_T] = \frac{1}{2}\int w(x)(Dp_T(x))^2 dx$ — identifies regions of high structural information density
  2. Jump set $J_{p_T} = {x : Dp_T(x) > 3\sigma}$ — locates conceptual boundaries (topic shifts, reasoning transitions)
  3. Gap energy density over 64-token windows — measures Cantor-type drift invisible to both smooth and jump analysis

Windows are cut at low-discrepancy positions rather than fixed stride. Loss weight is amplified at jump positions (1.25×). The topology tells you where the knowledge has architecture.

Why This Matters (Meta-Discrepancy Theorem). Theorem 11.15 of the DISC monograph proves: when the gap measure $\mu_{\text{gap}} > 0$ and discrepancy energy $E_{\text{disc}} > 0$, the classical FTC/MVT/chain-rule package is impossible on positive measure. Standard KD — which implicitly assumes smooth teacher distributions — provably cannot capture the structural information that TKD preserves. This is not a heuristic argument. It is a mathematical impossibility result.

Related Models

Model Description Downloads
Qwen3-1.7B-Thinking-Distil TKD with Thinking teacher 1,188
Qwen3-1.7B-Coder-Distilled-SFT TKD with Coder teacher 966
DiStil-Qwen3-1.7B-uncensored Uncensored base for DISC chain 1,030
DualMind Dual cognition on shared weights 260
Dualmind-Qwen-1.7B-Thinking Opus 4.6 reasoning traces → 1.7B New

DistilQwen Collection — Full proof-weighted distillation series (9 models)

Citation

@misc{colca2026topologicalqwen,
  title={TopologicalQwen: Topology-Aware Knowledge Distillation via Bounded Variation Decomposition},
  author={Colca, Roy S.},
  year={2026},
  publisher={HuggingFace},
  url={https://huggingface.co/reaperdoesntknow/TopologicalQwen},
  note={Convergent Intelligence LLC: Research Division}
}


From the Convergent Intelligence Portfolio

DistilQwen Collection — Our only BF16 series. Proof-weighted distillation from Qwen3-30B-A3B → 1.7B and 0.6B on H100. Three teacher variants (Instruct, Thinking, Coder), nine models. The rest of the portfolio proves structure beats scale on CPU. This collection shows what happens when you give the methodology real hardware.

Top model: Qwen3-1.7B-Thinking-Distil — 1,188 downloads

DualMind Collection — Dual cognition architecture. Single model, two internal voices, three cognitive phases. Five models including Dualmind-Qwen-1.7B-Thinking (Opus 4.6 reasoning variant).

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165) | From Three Teachers to Dual Cognition (DOI: 10.57967/hf/8184)

Convergent Intelligence LLC: Research Division

Convergent Intelligence LLC: Research Division "Where classical analysis fails to see, we begin."


Part of the reaperdoesntknow research portfolio — 49 models, 22,598 total downloads | Last refreshed: 2026-03-30 12:02 UTC

Downloads last month
1,134
Safetensors
Model size
2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for reaperdoesntknow/TopologicalQwen

Collection including reaperdoesntknow/TopologicalQwen