Model Card for Qemma-Q14B

Gap Envelope Integral

  • My mathematical formulation to utilize space projections to "measure" the Jump between points of discontinuity found in Non-Differentialable Functions.

Redux

  • This Model underwent an additional merge between Qemma-redux and Qwen3-14B, in addition to adding Rope Scaling.

Additionally

  • Fusion Logic was updated to aid per layer fusion and post fusion embedding alignment.
  • Qemma is a HuggingFace-native hybrid model that merges Gemma-3 (1B) and Qwen-3 (14B) at the weight level (no adapters).
  • Design: Gemma MLP/body + Qwen attention/head, projected and aligned to Gemma’s hidden size. The model is then SFT-tuned for stepwise reasoning.
  • This variant uses Yarn based Rope Scaling with 1:* Ratio from max_position_embeddings = 524288

Quick start

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_id = "reaperdoesntknow/Qemma-Q14B"
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16).eval()

text = (
    "<|user|>"
    "What makes the sky blue?."
    "<|assistant|>"
    "<think><reasoning_step>"
)

inputs = tokenizer(text, return_tensors="pt", max_length=64, padding='max_length', truncation=True)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.no_grad():
    model.eval()
    outputs = model.generate(**inputs, max_new_tokens=256, do_sample=True, min_length=32)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

What’s inside

  • Architecture:
  • Gemma-3 backbone (26 layers, hidden 1152, MLP 6912)
  • Qwen-style attention regrouped to Gemma’s 4×256 heads. (head_dim=128, hidden=5120, intermediate_size=17408, num_attn_heads=40, KV heads=8, num_hidd_layers=40)
  • Tokenizer: Gemma-3 tokenizer and chat template (see chat_template.jinja).
  • Training: SFT for instruction following and stepwise reasoning.

Intended use & limitations

Use: research, instruction following, code/help, analysis, further SFT/RLHF. Limits: may hallucinate; not for safety-critical, medical, legal, or financial decisions. Follow dataset/model licenses.

Training procedure

  • ~512 warm-start steps (HuggingFaceH4/ultrachat_200k) ~ A small post fussion training round was done (8 steps): to encourage embedding realignment.
  • ~256 SFT steps with (TIGER-Lab/MathInstruct + HuggingFaceH4/ultrachat_200k)

Framework versions

  • TRL: 0.25.0
  • Transformers: 4.57.1
  • Pytorch: 2.8.0+cpu
  • Datasets: 4.4.1
  • Tokenizers: 0.22.1

Mathematical Foundations: Discrepancy Calculus (DISC)

Qemma's cross-architecture fusion is built on the Gap Envelope Integral (GEI) — Chapter 7 of the DISC monograph. The discrepancy operator:

Df(x)=limε01εxx+εf(t)f(x)txdtDf(x) = \lim_{\varepsilon \downarrow 0} \frac{1}{\varepsilon} \int_x^{x+\varepsilon} \frac{|f(t) - f(x)|}{|t - x|}\, dt

When two architectures (Gemma, Qwen) are fused at the weight level, the resulting function is not smooth — it has jump discontinuities at layer boundaries where the two architectures' representations meet. The Mesh Fundamental Identity decomposes this:

f(b)f(a)=abf(x)dx+xJfΔf(x)+Dcf(I)f(b) - f(a) = \int_a^b f'(x)\,dx + \sum_{x \in J_f} \Delta f(x) + D^c f(I)

The GEI specifically measures the jump component $\sum |\Delta f(x)|$ — the total mismatch at fusion boundaries. The model name (GEI) references this directly: the Gap Envelope Integral is the mathematical tool for quantifying what happens when two different weight geometries are projected into a shared space.

Full theory: "On the Formal Analysis of Discrepancy Calculus" (Colca, 2026; Convergent Intelligence LLC: Research Division).

Citations

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}

Convergent Intelligence Portfolio

By Convergent Intelligence LLC: Research Division

Top Models from Our Lab

Total Portfolio: 49 models, 22,598 total downloads

Last updated: 2026-03-28 12:57 UTC


From the Convergent Intelligence Portfolio

DistilQwen Collection — Our only BF16 series. Proof-weighted distillation from Qwen3-30B-A3B → 1.7B and 0.6B on H100. Three teacher variants (Instruct, Thinking, Coder), nine models, 2,788 combined downloads. The rest of the portfolio proves structure beats scale on CPU. This collection shows what happens when you give the methodology real hardware.

Top model: Qwen3-1.7B-Coder-Distilled-SFT — 508 downloads

Full methodology: Structure Over Scale (DOI: 10.57967/hf/8165)

Convergent Intelligence LLC: Research Division

Downloads last month
423
Safetensors
Model size
1B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for reaperdoesntknow/Qemma-Q14B

Merge model
this model

Datasets used to train reaperdoesntknow/Qemma-Q14B

Collection including reaperdoesntknow/Qemma-Q14B