A prototype model that is assembled as a mirrored transformer architecture with nested gating (adds an extra weight to the FFN) and morphological position encoding. It proposes that the model architecture creates different scaffolding, leading to different training regimens and capabilities.
Prisma is only viable as it piggybacks on pre-trained tokenizers and their weight-tied embeddings, it decomposes the transformer architecture into symmetric expand and compress phases that share structural weights, connected by a small number of unique middle layers. Information expands from tokens to semantics, then compresses back — like light through a prism.
Key Concepts
Mirrored layers. Each expand layer shares W1 (projection) and W2 (output) weights with its corresponding compress layer. The architecture gets 2N virtual layers of processing from N unique parameter sets. At 357M parameters, Prisma runs 41 virtual layers from ~20 unique weight sets + 1 middle layer.
G²LU — Gated-Gated Linear Unit. The gate is itself gated:
Where typical gated transformers have y = W2 @ (W1 @ x * silu(W3 @ x)), Prisma has:
g4 = silu(W4 @ x) # inner gate
g3 = silu(W3 @ x * g4) # outer gate, modulated by inner
y = W2 @ (W1 @ x * g3) # gated output
One gate in function of the other. Creates quadratic (saddle-surface) decision boundaries instead of linear hyperplanes — each neuron computes a conjunction ("feature A AND feature B") rather than a single threshold. This produces narrow, separated activation channels that resist memorization and tolerate significantly higher learning rates. Part of the parameters saved with mirroring are re-distributed as W4.
WoRPE — Word-position Rotary Position Embedding. Dedicates a small subspace of each attention head to encode position within a word (0 = prefix, 1 = second subword, ...). The information is already in the BPE tokenizer's word-boundary markers — WoRPE surfaces it geometrically so the model doesn't have to rediscover it. No new tokenizer required.
Auxiliary skip prediction. An optional second head predicts t+K tokens ahead, providing gradient signal that rewards structural representations over local memorization. At K=1, functions as a dual-supervision regularizer through an untied projection.
Results
~50M scale prototype (WikiText-103, 4 epochs)
| Model | Params | LR | WikiText PPL | LAMBADA |
|---|---|---|---|---|
| Standard SwiGLU | 51M | 1e-4 | 4125 | 0.002 |
| Prisma (G²LU) | 47M | 1e-4 | 2914 | 0.001 |
| Prisma (G²LU + WoRPE) | 51M | 1e-2 | 921 | 0.082 |
Standard trained 10 epochs; Prisma (G²LU + WoRPE) shown at 1 epoch — the point is LR tolerance, not epoch-matched comparison.
The regularization stack (mirroring + G²LU + WoRPE) enables training at 100x the standard learning rate without instability.
~350M scale prototype — comparison with published models
Prisma 357M trained on ~30B tokens (OpenWebText 20% + FineWeb-Edu 10BT continued training), compared against published models at similar scale.
| Model | Params | Train Data | ARC-C* | ARC-E* | BoolQ | HellaSwag* | LAMBADA | PIQA* | WikiText** | WinoGrande |
|---|---|---|---|---|---|---|---|---|---|---|
| GPT-2 medium | 355M | 40B | 0.250 | 0.436 | 0.586 | 0.394 | 0.430 | 0.664 | 26.75 | 0.531 |
| Baguettotron | 321M | 200B | 0.302 | 0.506 | 0.589 | 0.354 | 0.294 | 0.618 | 30.93 | 0.530 |
| SmolLM-360M | 360M | 600B | 0.359 | 0.640 | 0.550 | 0.536 | 0.455 | 0.715 | 19.49 | 0.570 |
| SmolLM2-360M | 360M | 4000B | 0.381 | 0.681 | 0.617 | 0.431 | 0.532 | 0.718 | 15.67 | 0.586 |
| LFM2-350M | 350M | 10000B | 0.393 | 0.662 | 0.642 | 0.489 | 0.399 | 0.698 | 25.68 | 0.558 |
| Prisma | 357M | 30B | 0.290 | 0.548 | 0.620 | 0.427 | 0.362 | 0.670 | 27.40 | 0.506 |
* normalized accuracy · ** word perplexity
Key findings:
- Beats GPT-2 medium on 5/8 benchmarks (ARC-C, ARC-E, BoolQ, HellaSwag, PIQA) with 25% less training data.
- Beats Baguettotron (200B) on 6/8 benchmarks — including PPL — with 7x less data.
- BoolQ 0.620 exceeds all models except LFM2 (10000B) and SmolLM2 (4000B). The anti-memorization properties of G²LU force genuine comprehension instead of statistical shortcuts.
- ARC-Easy 0.548 — the largest absolute gain over GPT-2 medium (+11.2pp). FineWeb-Edu knowledge absorbed efficiently through G²LU's relational features.
- Prisma wins on reasoning benchmarks (ARC, HellaSwag, PIQA, BoolQ). Models trained on 20-300x more data win on content prediction (LAMBADA, PPL). The architecture trades raw memorization for data-efficient knowledge application.
Training progression (~350M)
| Stage | LR | ARC-C | ARC-E | BoolQ | HellaSwag | LAMBADA | PIQA | PPL |
|---|---|---|---|---|---|---|---|---|
| Standard 336M baseline | 1e-4 | 0.228 | 0.341 | 0.618 | 0.280 | 0.226 | 0.574 | 77.2 |
| Prisma 41L (OWT 20%) | 5e-4 | 0.238 | 0.394 | 0.585 | 0.317 | 0.313 | 0.614 | 44.8 |
| + WoRPE (OWT 20%) | 1e-3 | 0.247 | 0.397 | 0.595 | 0.331 | 0.333 | 0.614 | 43.5 |
| + continued (FineWeb c1) | 1e-3 | 0.249 | 0.434 | 0.601 | 0.333 | 0.312 | 0.626 | 34.7 |
| + continued (FineWeb c2) | 1e-3 | 0.290 | 0.548 | 0.620 | 0.427 | 0.362 | 0.670 | 27.4 |
Prisma 357M detailed training charts
Quick Start
Load from HuggingFace
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("y3i12/Prisma", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("y3i12/Prisma", use_fast=False)
Note:
use_fast=Falseis required. The fast tokenizer for MobileLLM is broken upstream and returns aboolinstead of a tokenizer object.
Install (for training / development)
pip install -r Prisma/requirements.txt
Train
# Small Prisma (~47M) on WikiText-103
python -m Prisma.train \
--arch mirrored --dims 384 --heads 6 --kv-heads 2 --layers 57 --n-middle 1 \
--tokenizer facebook/MobileLLM-125M \
--word-rope-dims 8 --word-rope-base 10.0 \
--data hf:wikitext:wikitext-103-raw-v1:train \
--epochs 4 --batch-size 32 --context-length 512 \
--lr 1e-2 --warmup-steps 500 --bf16 --gpu 0
# 324M Prisma on OpenWebText
python -m Prisma.train \
--gpu 0 --compile --bf16 --arch mirrored \
--dims 1024 --heads 16 --kv-heads 4 --layers 41 --n-middle 1 \
--word-rope-dims 8 --word-rope-base 10.0 \
--tokenizer facebook/MobileLLM-125M \
--data hf:Bingsu/openwebtext_20p --text-column text \
--epochs 4 --batch-size 12 --context-length 1024 --grad-accum 42 \
--lr 1e-3 --warmup 500 \
--log-every 5 --val-every 1000 --save-every 1000 \
--checkpoint-dir path/to/checkpoints/your_model/
Generate
python -m Prisma.generate \
--checkpoint path/to/checkpoints/your_model/checkpoint.pt \
--prompt "A thought observing itself discovers that it" \
--max-tokens 256 --temperature 0.8 --top-p 0.9
With HuggingFace SafeTensor model:
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained('y3i12/Prisma', trust_remote_code=True)
print(f'Params: {sum(p.numel() for p in model.parameters()):,}')
tokenizer = AutoTokenizer.from_pretrained('y3i12/Prisma', use_fast=False)
model = model.to('cuda').half()
inputs = tokenizer('A thought observing itself discovers that it', return_tensors='pt').to('cuda')
with torch.no_grad():
out = model.generate(inputs.input_ids, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.9)
print(tokenizer.decode(out[0]))
Benchmark
# Single model
python -m Prisma.bench \
--checkpoint path/to/checkpoints/your_model/best.pt \
--tasks arc_challenge,arc_easy,boolq,hellaswag,lambada_openai,piqa,wikitext,winogrande \
--gpu 0
With HuggingFace SafeTensor model and lm_eval:
lm_eval --task arc_challenge,arc_easy,boolq,hellaswag,lambada_openai,piqa,wikitext,winogrande --model hf --model_args pretrained=y3i12/Prisma,trust_remote_code=True,use_fast_tokenizer=False
CLI Reference
Architecture
| Flag | Default | Description |
|---|---|---|
--arch |
mirrored |
Architecture: standard, mirrored, graft_g2lu (experimental) |
--dims |
512 | Hidden dimension |
--heads |
8 | Number of attention heads |
--kv-heads |
— | KV heads for GQA (omit = MHA) |
--layers |
12 | Total virtual layers (expand + middle + compress) |
--n-middle |
2 | Unique (non-mirrored) middle layers |
Prisma-Specific
| Flag | Default | Description |
|---|---|---|
--word-rope-dims |
0 | Head dims for WoRPE (0 = disabled, try 8) |
--word-rope-base |
10.0 | WoRPE frequency base |
--aux-skip |
0 | Skip-ahead prediction distance (0 = disabled) |
--aux-weight |
0.1 | Weight for auxiliary loss |
--no-g2lu |
— | Disable G²LU, use standard SwiGLU in mirrored arch |
Training
| Flag | Default | Description |
|---|---|---|
--lr |
3e-4 | Peak learning rate |
--min-lr |
0.0 | LR floor for cosine schedule |
--warmup-steps |
100 | LR warmup steps |
--epochs |
10 | Training epochs |
--batch-size |
32 | Micro-batch size |
--grad-accum |
1 | Gradient accumulation steps |
--context-length |
512 | Sequence length |
--bf16 / --fp16 |
— | Mixed precision |
--compile |
— | torch.compile the model |
Data
| Flag | Default | Description |
|---|---|---|
--data |
— | Path or hf:dataset_name |
--text-column |
text |
Column name for HF datasets |
--tokenizer |
gpt2 |
Tokenizer name or path |
--num-samples |
— | Limit dataset size |
Architecture Details
Full spectral and representation analysis with all plots: ANALYSIS.md
Why Mirroring Works
Mirroring only works due to the additional gate. W3 and W4 specialize to serve different roles despite sharing weights — spectral analysis confirms the gates swap their stable-rank profiles at the architectural midpoint. The order of mirror layers may be rearrangeable, as the gates adapt to whatever representations flow through them.
Why G²LU Works
Standard SwiGLU creates hyperplane decision boundaries — broad, overlapping activation regions. G²LU's nested gate creates saddle surfaces — narrow activation bands with isolation gaps (like a spectral comb filter). This has three effects:
- Anti-memorization. The gate geometry cannot form sharp, input-specific activations. The model is forced toward broad, relational features.
- Higher LR tolerance. Narrow activation bands leave headroom between features. Large gradient updates shift features within their bands without colliding.
- Compositional detection. Each neuron natively computes conjunctions (A AND B), not just thresholds. Might be useful for morphology, syntax, and structural reasoning.
G²LU can be seen as occupying a point between standard GLU (fixed activation, fixed gate) and KAN (fully learned activations): the activation function is fixed (silu), but its effective shape adapts per-input through the nested gate.
Why WoRPE Works
BPE tokenizers already mark word boundaries (Ġ for GPT-2, ▁ for SentencePiece). WoRPE surfaces this information geometrically in a dedicated subspace of the rotary embedding, so the model gets word-internal position for free instead of rediscovering it from attention patterns. Requires G²LU to exploit effectively — the saddle surfaces compute morphological conjunctions ("position-0 AND prefix-pattern") that single gates cannot.
Why Everything Works Together
The optimization landscape of this architecture is substantially more complex than a standard transformer — shared weights must serve both directions, nested gates must coordinate, and the hourglass bottleneck constrains information flow. This appears to be only tractable when anchored by pre-trained, weight-tied embeddings that provide a stable coordinate system. The frozen embeddings give the model fixed reference geometry, allowing convergence despite the architectural complexity.
File Map
Prisma/
config.py — CLI arguments, presets, CircuitConfig
layers.py — RMSNorm, RoPE, WoRPE, CausalAttention, SwiGLU
model.py — CircuitTransformer (standard baseline)
mirrored.py — MirroredTransformer, G²LU, MirroredBlock
train.py — Training loop, LR schedule, checkpointing
data.py — MemmapDataset, parallel tokenization, HF/text loading
generate.py — Text generation with KV caching
bench.py — Benchmark runner and comparison tables
lm_eval_wrapper.py — EleutherAI lm-eval harness integration
graft_g2lu.py — Surgical G²LU upgrade for pretrained models (experimental/untested)
scripts/ — Analysis scripts
Origin
Prisma grew from interpretability research on layer grafting (writing in progress) in Llama 3.2, which suggests that one of the ways that transformers might self organize to process language can be seen as like a mirrored structure that expands from tokens to semantics, then compressing back — bringing the interpretive analogy of seeing it as a biconvex lens with fractures or polarizing filters within its body. If the two halves are symmetric structurally, they can share weights. The gate (fractures/polarizing filters) becomes the minimum surgical unit for changing behavior. A single weightset becomes insufficient due to shared weights, which brought the question of how to properly make two gates efficiently collaborate.
G²LU emerged from the observation that for a pair of gates to be expressive and atomic, one gate needs to be in function of the other.
WoRPE emerged from noticing, that tokenizers already carry word structure but positional encodings ignore it — providing hints to the model allows faster convergence during training.
The architecture is a processing engine that plugs into pretrained tokenizer embeddings. The tokenizer is load-bearing infrastructure — Prisma operates within a pre-existing coordinate system.
Developer Notes
This model is the outcome of a POC done by a single individual with limited resources, further investigation, training and tests are being slowly conducted as time and conditions allow.
The proposed architecture was only fully trained on top of facebook/MobileLLM-125M tokenizer and weight-tied embeddings. It might be the case that it doesn't work as expected on untied embeddings and it is highly likely that it is impossible to train a model with this architecture without a pre-trained tokenizer.
Different arrangements of the architecture (varying middle layer count, mirror depth, width) would likely produce different results. Only this setup — with 1 middle layer — was tested, as a validation of whether the architecture works at all. The extreme case was chosen deliberately: if the bottleneck configuration most prone to failure still produces competitive results, less constrained configurations should too.
Factorized dimensions for embeddings and an intermediate down proj before the output head were attempted, and nothing useful came out of it.
It is completely unknown if the architecture is beneficial for larger models (1B+) — observations suggests it might.
Training
- Architecture:
- 41 layers
- 20 with shared W1 and W2
- 1 unique
- 1024 dims
- 16 GQA heads, 4 KV heads (4:1)
- vocab size 32k
- RoPE + WoRPE + G²LU
- 41 layers
- Pretraining tokens: 30B
- Precision: bfloat16
- Tokenizer/Embeddings: facebook/MobileLLM-125M
- Hardware: 1 H100
Disclaimer
This model is developed as a research model and it hasn't been tested thoroughly regarding synthesis and coherence quality, as its size is somewhat limiting. Use it at your own risk.
Citation
@misc{ivatchkovitch2026prisma,
title={Prisma: Interpretability-Inspired Mirrored Transformer Architecture},
author={Yuri Ivatchkovitch},
year={2026},
howpublished={\url{https://huggingface.co/y3i12/Prisma}},
}
- Downloads last month
- 62






