Spartacus-1B-Instruct โ Causal Monoid Language Model
A 1.3B parameter language model that replaces softmax attention with causal monoid state compression, achieving O(1) time per token and O(1) memory at inference โ regardless of sequence length.
Fine-tuned for enhanced reasoning with structured chain-of-thought data.
Monoid Attention โ Internal Structure
MonoidAttention (per layer, per head)
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ x_t โ R^{2048} โ
โ โ โ
โ โโโ> q_proj โโ> RMSNorm โโ> q_t โ R^{d} (query) โ
โ โ โ
โ โโโ> k_proj โโ> RMSNorm โโ> SiLU โโ> k_t โ R^{d} (key, >= 0) โ
โ โ โ
โ โโโ> v_proj โโ> v_t โ R^{d} (value) โ
โ โ โ
โ โโโ> decay_proj โโ> sigmoid โโ> alpha_t โ (0,1) (decay gate) โ
โ โ
โ k_t (x) v_t โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ โ State Matrix S_t โ R^{d x d} โ โ
โ v โ โ โ
โ S_t = alpha_t * S_{t-1} + k_t (x) v_t โ โ
โ โ โ "Compressed causal history" โ โ
โ โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ v โ
โ o_t = q_t . S_t โโ> o_proj โโ> output โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Monoid State Diagonal โ O(1) Compression Contour
The state matrix S_t accumulates causal history along its diagonal. Each head maintains an independent d x d state that compresses ALL past tokens into a fixed footprint:
State Matrix S_t โ R^{64 x 64} (one per head, 32 heads per layer)
k-dim -->
0 8 16 24 32 40 48 56 63
โโโโโฌโโโโฌโโโโฌโโโโฌโโโโฌโโโโฌโโโโฌโโโโ 0
โ***โ** โ* โ โ โ โ โ โ v-dim
โ***โ** โ* โ. โ โ โ โ โ |
โโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโค 8 |
โ** โ***โ** โ* โ. โ โ โ โ v
โ* โ***โ** โ* โ. โ โ โ โ
โโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโค 16
โ* โ** โ***โ** โ* โ. โ โ โ
โ. โ* โ***โ** โ* โ. โ โ โ
โโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโค 24
โ โ. โ** โ***โ** โ* โ. โ โ
โ โ โ* โ***โ** โ* โ. โ โ
โโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโค 32
โ โ โ. โ** โ***โ** โ* โ. โ
โ โ โ โ* โ***โ** โ* โ. โ
โโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโค 40
โ โ โ โ. โ** โ***โ** โ* โ
โ โ โ โ โ* โ***โ** โ* โ
โโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโค 48
โ โ โ โ โ. โ** โ***โ** โ
โ โ โ โ โ โ* โ***โ** โ
โโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโผโโโโค 56
โ โ โ โ โ โ. โ** โ***โ
โ โ โ โ โ โ โ* โ***โ
โโโโโดโโโโดโโโโดโโโโดโโโโดโโโโดโโโโดโโโโ 63
Legend: *** = high activation (recent tokens, alpha^0 ~ alpha^2)
** = medium (alpha^3 ~ alpha^5)
* = fading (alpha^6 ~ alpha^10)
. = near-zero (alpha^11+, effectively forgotten)
= zero (never reached or fully decayed)
The diagonal band emerges because S_t = SUM_{i<=t} alpha^{t-i} * k_i (x) v_i.
Recent outer products dominate near the diagonal; older ones decay
exponentially via alpha, creating this characteristic contour.
Key Properties
| Property | Transformer (Llama) | Spartacus (Monoid) |
|---|---|---|
| Inference time per token | O(T) -- scans full KV-cache | O(1) -- single state update |
| Inference memory per layer | O(T) -- stores all past K,V | O(1) -- fixed d x d state matrix |
| Sequence length extrapolation | Degrades beyond training length | Unlimited -- state size is constant |
| Causality | Imposed via attention mask | Built into the recurrence |
| Training complexity | O(T^2) | O(T) via parallel prefix scan |
The Monoid Recurrence
Standard attention computes:
o_t = sum_{i<=t} softmax(q_t . k_i) v_i -- requires O(T) KV-cache
Monoid attention compresses the entire causal history into a fixed-size state matrix S_t per head:
S_t = alpha_t * S_{t-1} + k_t (x) v_t -- explicit causal recurrence
o_t = q_t . S_t -- state readout
where alpha_t = sigmoid(decay_proj(x_t)) is a learned, content-dependent decay gate that controls how fast past information fades.
Explicit Causal Modeling
Unlike Transformers where causality is a constraint imposed by masking, Spartacus makes causality a first-class citizen:
- The decay gate
alpha_texplicitly controls per-head information retention at every timestep - The model learns when to forget rather than encoding where tokens are (no positional encoding needed)
- No attention mask required -- causality is structural, not enforced
Design Choices
- SiLU-activated keys:
k = SiLU(k_proj(x))ensures non-negative keys, making the state matrixSpositive semi-definite (PSD). This prevents "feature erasure" where one token's contribution cancels another's - Log-space decay: Working in log-space
log(alpha)avoids numerical underflow whenalpha^T -> 0for long sequences - Learnable h0: The initial state
S_0 = h0is a learnable parameter (zero-initialized), acting as a compressed "system prompt"
Model Details
| Parameter | Value |
|---|---|
| Model | NoesisLab/Spartacus-1B-Instruct |
| Architecture | MonoidForCausalLM |
| Parameters | ~1.34B (tied embeddings) |
| Hidden size | 2048 |
| Intermediate size (MLP) | 8192 |
| Layers | 16 |
| Attention heads | 32 |
| Head dimension | 64 |
| State matrix per head | 64 x 64 = 4096 floats |
| Vocabulary | 128,256 (Llama-3.2 tokenizer) |
| Precision | bfloat16 |
Benchmarks (0-shot)
| Task | Metric | Value | Stderr |
|---|---|---|---|
| ARC-Challenge | acc_norm | 0.3063 | ยฑ0.0135 |
| ARC-Easy | acc | 0.5518 | ยฑ0.0102 |
| HellaSwag | acc_norm | 0.4610 | ยฑ0.0050 |
| PIQA | acc_norm | 0.6915 | ยฑ0.0108 |
| WinoGrande | acc | 0.5225 | ยฑ0.0140 |
Comparison with ~1B Baselines (acc_norm, 0-shot)
| Task | Spartacus-1B-Instruct | TinyLlama-1.1B | Llama 3.2-1B | Mamba-1.4B | RWKV-6-1.6B |
|---|---|---|---|---|---|
| ARC-C | 0.3063 | 0.3268 | ~0.359 | 0.284 | ~0.301 |
| ARC-E | 0.5518 | 0.5547 | ~0.752 | 0.512 | ~0.530 |
| HellaSwag | 0.4610 | 0.4670 | ~0.546 | 0.435 | ~0.450 |
| PIQA | 0.6915 | 0.7210 | ~0.740 | 0.655 | ~0.670 |
| WinoGrande | 0.5225 | 0.5040 | ~0.592 | 0.510 | ~0.515 |
Spartacus achieves competitive performance with sub-quadratic models (Mamba, RWKV) while maintaining O(1) inference time and memory per token. Scores marked with ~ are approximate community-reported values.
Training
Stage 1: General SFT
- Base weights: Transferred from Llama-3.2-1B-Instruct (embeddings, MLP, norms)
- Data: Capybara + smol-smoltalk (general conversation)
- Training: Full-parameter SFT
Stage 2: Reasoning Enhancement
- Data mix: 60% Qwen3-Short-Reasoning + 20% Capybara + 20% smol-smoltalk
- Steps: 2,000
- Learning rate: 2e-5 (cosine schedule, 50 warmup steps)
- Batch size: 8
- Sequence length: 2,048
- Precision: bfloat16
- Optimizer: AdamW (weight decay 0.01, max grad norm 1.0)
The reasoning data uses structured "Thought + Solution" format to strengthen chain-of-thought capabilities while the general data prevents catastrophic forgetting.
Parallel Scan Implementation
The monoid_scan_cuda.py module provides a Triton JIT-compiled parallel prefix scan:
- Forward: Sequential scan along T, parallelized across B x H x D on GPU via Triton kernels
- Backward: Reverse-order adjoint scan computes gradients for both values and log-decay gates
- Fallback: Pure PyTorch sequential scan for CPU/MPS
- Auto-dispatch: CUDA -> Triton kernel, otherwise -> PyTorch fallback
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"NoesisLab/Spartacus-1B-Instruct",
trust_remote_code=True,
torch_dtype="bfloat16",
device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("NoesisLab/Spartacus-1B-Instruct")
messages = [{"role": "user", "content": "Hello!"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
File Structure
MonoidForCausalLM.py # Model architecture (MonoidConfig, MonoidAttention, MonoidForCausalLM)
monoid_scan_cuda.py # Triton JIT parallel prefix scan + PyTorch fallback
model.safetensors # Model weights (bfloat16)
config.json # Model configuration
tokenizer.json # Llama-3.2 tokenizer
Citation
@software{spartacus2025,
title={Spartacus: Causal Monoid Language Model with O(1) Inference},
author={NoesisLab},
year={2025},
url={https://huggingface.co/NoesisLab/Spartacus-1B-Instruct},
description={Replaces softmax attention with monoid state compression for constant-time, constant-memory autoregressive generation}
}
License
Apache 2.0
- Downloads last month
- 12