Prisma

A prototype model that is assembled as a mirrored transformer architecture with nested gating (adds an extra weight to the FFN) and morphological position encoding. It proposes that the model architecture creates different scaffolding, leading to different training regimens and capabilities.

Prisma is only viable as it piggybacks on pre-trained tokenizers and their weight-tied embeddings, it decomposes the transformer architecture into symmetric expand and compress phases that share structural weights, connected by a small number of unique middle layers. Information expands from tokens to semantics, then compresses back — like light through a prism.

Key Concepts

Mirrored layers. Each expand layer shares W1 (projection) and W2 (output) weights with its corresponding compress layer. The architecture gets 2N virtual layers of processing from N unique parameter sets. At 357M parameters, Prisma runs 41 virtual layers from ~20 unique weight sets + 1 middle layer.

G²LU — Gated-Gated Linear Unit. The gate is itself gated:

Where typical gated transformers have y = W2 @ (W1 @ x * silu(W3 @ x)), Prisma has:

g4 = silu(W4 @ x)              # inner gate
g3 = silu(W3 @ x * g4)         # outer gate, modulated by inner
y  = W2 @ (W1 @ x * g3)        # gated output

One gate in function of the other. Creates quadratic (saddle-surface) decision boundaries instead of linear hyperplanes — each neuron computes a conjunction ("feature A AND feature B") rather than a single threshold. This produces narrow, separated activation channels that resist memorization and tolerate significantly higher learning rates. Part of the parameters saved with mirroring are re-distributed as W4.

WoRPE — Word-position Rotary Position Embedding. Dedicates a small subspace of each attention head to encode position within a word (0 = prefix, 1 = second subword, ...). The information is already in the BPE tokenizer's word-boundary markers — WoRPE surfaces it geometrically so the model doesn't have to rediscover it. No new tokenizer required.

Auxiliary skip prediction. An optional second head predicts t+K tokens ahead, providing gradient signal that rewards structural representations over local memorization. At K=1, functions as a dual-supervision regularizer through an untied projection.

Results

~50M scale prototype (WikiText-103, 4 epochs)

Model	Params	LR	WikiText PPL	LAMBADA
Standard SwiGLU	51M	1e-4	4125	0.002
Prisma (G²LU)	47M	1e-4	2914	0.001
Prisma (G²LU + WoRPE)	51M	1e-2	921	0.082

Standard trained 10 epochs; Prisma (G²LU + WoRPE) shown at 1 epoch — the point is LR tolerance, not epoch-matched comparison.

The regularization stack (mirroring + G²LU + WoRPE) enables training at 100x the standard learning rate without instability.

~350M scale prototype — comparison with published models

Prisma 357M trained on ~30B tokens (OpenWebText 20% + FineWeb-Edu 10BT continued training), compared against published models at similar scale.

Model	Params	Train Data	ARC-C*	ARC-E*	BoolQ	HellaSwag*	LAMBADA	PIQA*	WikiText**	WinoGrande
GPT-2 medium	355M	40B	0.250	0.436	0.586	0.394	0.430	0.664	26.75	0.531
Baguettotron	321M	200B	0.302	0.506	0.589	0.354	0.294	0.618	30.93	0.530
SmolLM-360M	360M	600B	0.359	0.640	0.550	0.536	0.455	0.715	19.49	0.570
SmolLM2-360M	360M	4000B	0.381	0.681	0.617	0.431	0.532	0.718	15.67	0.586
LFM2-350M	350M	10000B	0.393	0.662	0.642	0.489	0.399	0.698	25.68	0.558
Prisma	357M	30B	0.290	0.548	0.620	0.427	0.362	0.670	27.40	0.506

* normalized accuracy · ** word perplexity

Key findings:

Beats GPT-2 medium on 5/8 benchmarks (ARC-C, ARC-E, BoolQ, HellaSwag, PIQA) with 25% less training data.
Beats Baguettotron (200B) on 6/8 benchmarks — including PPL — with 7x less data.
BoolQ 0.620 exceeds all models except LFM2 (10000B) and SmolLM2 (4000B). The anti-memorization properties of G²LU force genuine comprehension instead of statistical shortcuts.
ARC-Easy 0.548 — the largest absolute gain over GPT-2 medium (+11.2pp). FineWeb-Edu knowledge absorbed efficiently through G²LU's relational features.
Prisma wins on reasoning benchmarks (ARC, HellaSwag, PIQA, BoolQ). Models trained on 20-300x more data win on content prediction (LAMBADA, PPL). The architecture trades raw memorization for data-efficient knowledge application.

Training progression (~350M)

Stage	LR	ARC-C	ARC-E	BoolQ	HellaSwag	LAMBADA	PIQA	PPL
Standard 336M baseline	1e-4	0.228	0.341	0.618	0.280	0.226	0.574	77.2
Prisma 41L (OWT 20%)	5e-4	0.238	0.394	0.585	0.317	0.313	0.614	44.8
+ WoRPE (OWT 20%)	1e-3	0.247	0.397	0.595	0.331	0.333	0.614	43.5
+ continued (FineWeb c1)	1e-3	0.249	0.434	0.601	0.333	0.312	0.626	34.7
+ continued (FineWeb c2)	1e-3	0.290	0.548	0.620	0.427	0.362	0.670	27.4

Prisma 357M detailed training charts

Quick Start

Load from HuggingFace

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("y3i12/Prisma", trust_remote_code=True)
tokenizer = AutoTokenizer.from_pretrained("y3i12/Prisma", use_fast=False)

Note: use_fast=False is required. The fast tokenizer for MobileLLM is broken upstream and returns a bool instead of a tokenizer object.

Install (for training / development)

pip install -r Prisma/requirements.txt

Train

# Small Prisma (~47M) on WikiText-103
python -m Prisma.train \
  --arch mirrored --dims 384 --heads 6 --kv-heads 2 --layers 57 --n-middle 1 \
  --tokenizer facebook/MobileLLM-125M \
  --word-rope-dims 8 --word-rope-base 10.0 \
  --data hf:wikitext:wikitext-103-raw-v1:train \
  --epochs 4 --batch-size 32 --context-length 512 \
  --lr 1e-2 --warmup-steps 500 --bf16 --gpu 0


# 324M Prisma on OpenWebText
python -m Prisma.train \
  --gpu 0 --compile --bf16 --arch mirrored \
  --dims 1024 --heads 16 --kv-heads 4 --layers 41 --n-middle 1 \
  --word-rope-dims 8 --word-rope-base 10.0 \
  --tokenizer facebook/MobileLLM-125M \
  --data hf:Bingsu/openwebtext_20p --text-column text \
  --epochs 4 --batch-size 12 --context-length 1024 --grad-accum 42 \
  --lr 1e-3 --warmup 500 \
  --log-every 5 --val-every 1000 --save-every 1000 \
  --checkpoint-dir path/to/checkpoints/your_model/

Generate

python -m Prisma.generate \
  --checkpoint path/to/checkpoints/your_model/checkpoint.pt \
  --prompt "A thought observing itself discovers that it" \
  --max-tokens 256 --temperature 0.8 --top-p 0.9

With HuggingFace SafeTensor model:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained('y3i12/Prisma', trust_remote_code=True)

print(f'Params: {sum(p.numel() for p in model.parameters()):,}')

tokenizer = AutoTokenizer.from_pretrained('y3i12/Prisma', use_fast=False)
model     = model.to('cuda').half()
inputs    = tokenizer('A thought observing itself discovers that it', return_tensors='pt').to('cuda')

with torch.no_grad():
    out = model.generate(inputs.input_ids, max_new_tokens=100, do_sample=True, temperature=0.8, top_p=0.9)

print(tokenizer.decode(out[0]))

Benchmark

# Single model
python -m Prisma.bench \
  --checkpoint path/to/checkpoints/your_model/best.pt \
  --tasks arc_challenge,arc_easy,boolq,hellaswag,lambada_openai,piqa,wikitext,winogrande \
  --gpu 0

With HuggingFace SafeTensor model and lm_eval:

lm_eval --task arc_challenge,arc_easy,boolq,hellaswag,lambada_openai,piqa,wikitext,winogrande --model hf --model_args pretrained=y3i12/Prisma,trust_remote_code=True,use_fast_tokenizer=False

CLI Reference

Architecture

Flag	Default	Description
`--arch`	`mirrored`	Architecture: `standard`, `mirrored`, `graft_g2lu` (experimental)
`--dims`	512	Hidden dimension
`--heads`	8	Number of attention heads
`--kv-heads`	—	KV heads for GQA (omit = MHA)
`--layers`	12	Total virtual layers (expand + middle + compress)
`--n-middle`	2	Unique (non-mirrored) middle layers

Prisma-Specific

Flag	Default	Description
`--word-rope-dims`	0	Head dims for WoRPE (0 = disabled, try 8)
`--word-rope-base`	10.0	WoRPE frequency base
`--aux-skip`	0	Skip-ahead prediction distance (0 = disabled)
`--aux-weight`	0.1	Weight for auxiliary loss
`--no-g2lu`	—	Disable G²LU, use standard SwiGLU in mirrored arch

Training

Flag	Default	Description
`--lr`	3e-4	Peak learning rate
`--min-lr`	0.0	LR floor for cosine schedule
`--warmup-steps`	100	LR warmup steps
`--epochs`	10	Training epochs
`--batch-size`	32	Micro-batch size
`--grad-accum`	1	Gradient accumulation steps
`--context-length`	512	Sequence length
`--bf16` / `--fp16`	—	Mixed precision
`--compile`	—	`torch.compile` the model

Data

Flag	Default	Description
`--data`	—	Path or `hf:dataset_name`
`--text-column`	`text`	Column name for HF datasets
`--tokenizer`	`gpt2`	Tokenizer name or path
`--num-samples`	—	Limit dataset size

Architecture Details

Full spectral and representation analysis with all plots: ANALYSIS.md

Why Mirroring Works

Mirroring only works due to the additional gate. W3 and W4 specialize to serve different roles despite sharing weights — spectral analysis confirms the gates swap their stable-rank profiles at the architectural midpoint. The order of mirror layers may be rearrangeable, as the gates adapt to whatever representations flow through them.

Why G²LU Works

Standard SwiGLU creates hyperplane decision boundaries — broad, overlapping activation regions. G²LU's nested gate creates saddle surfaces — narrow activation bands with isolation gaps (like a spectral comb filter). This has three effects:

Anti-memorization. The gate geometry cannot form sharp, input-specific activations. The model is forced toward broad, relational features.
Higher LR tolerance. Narrow activation bands leave headroom between features. Large gradient updates shift features within their bands without colliding.
Compositional detection. Each neuron natively computes conjunctions (A AND B), not just thresholds. Might be useful for morphology, syntax, and structural reasoning.

G²LU can be seen as occupying a point between standard GLU (fixed activation, fixed gate) and KAN (fully learned activations): the activation function is fixed (silu), but its effective shape adapts per-input through the nested gate.

Why WoRPE Works

BPE tokenizers already mark word boundaries (Ġ for GPT-2, ▁ for SentencePiece). WoRPE surfaces this information geometrically in a dedicated subspace of the rotary embedding, so the model gets word-internal position for free instead of rediscovering it from attention patterns. Requires G²LU to exploit effectively — the saddle surfaces compute morphological conjunctions ("position-0 AND prefix-pattern") that single gates cannot.

Why Everything Works Together

The optimization landscape of this architecture is substantially more complex than a standard transformer — shared weights must serve both directions, nested gates must coordinate, and the hourglass bottleneck constrains information flow. This appears to be only tractable when anchored by pre-trained, weight-tied embeddings that provide a stable coordinate system. The frozen embeddings give the model fixed reference geometry, allowing convergence despite the architectural complexity.

File Map

Prisma/
  config.py          — CLI arguments, presets, CircuitConfig
  layers.py          — RMSNorm, RoPE, WoRPE, CausalAttention, SwiGLU
  model.py           — CircuitTransformer (standard baseline)
  mirrored.py        — MirroredTransformer, G²LU, MirroredBlock
  train.py           — Training loop, LR schedule, checkpointing
  data.py            — MemmapDataset, parallel tokenization, HF/text loading
  generate.py        — Text generation with KV caching
  bench.py           — Benchmark runner and comparison tables
  lm_eval_wrapper.py — EleutherAI lm-eval harness integration
  graft_g2lu.py      — Surgical G²LU upgrade for pretrained models (experimental/untested)
  scripts/           — Analysis scripts

Origin

Prisma grew from interpretability research on layer grafting (writing in progress) in Llama 3.2, which suggests that one of the ways that transformers might self organize to process language can be seen as like a mirrored structure that expands from tokens to semantics, then compressing back — bringing the interpretive analogy of seeing it as a biconvex lens with fractures or polarizing filters within its body. If the two halves are symmetric structurally, they can share weights. The gate (fractures/polarizing filters) becomes the minimum surgical unit for changing behavior. A single weightset becomes insufficient due to shared weights, which brought the question of how to properly make two gates efficiently collaborate.

G²LU emerged from the observation that for a pair of gates to be expressive and atomic, one gate needs to be in function of the other.

WoRPE emerged from noticing, that tokenizers already carry word structure but positional encodings ignore it — providing hints to the model allows faster convergence during training.

The architecture is a processing engine that plugs into pretrained tokenizer embeddings. The tokenizer is load-bearing infrastructure — Prisma operates within a pre-existing coordinate system.

Developer Notes

This model is the outcome of a POC done by a single individual with limited resources, further investigation, training and tests are being slowly conducted as time and conditions allow.

The proposed architecture was only fully trained on top of facebook/MobileLLM-125M tokenizer and weight-tied embeddings. It might be the case that it doesn't work as expected on untied embeddings and it is highly likely that it is impossible to train a model with this architecture without a pre-trained tokenizer.

Different arrangements of the architecture (varying middle layer count, mirror depth, width) would likely produce different results. Only this setup — with 1 middle layer — was tested, as a validation of whether the architecture works at all. The extreme case was chosen deliberately: if the bottleneck configuration most prone to failure still produces competitive results, less constrained configurations should too.

Factorized dimensions for embeddings and an intermediate down proj before the output head were attempted, and nothing useful came out of it.

It is completely unknown if the architecture is beneficial for larger models (1B+) — observations suggests it might.

Training

Architecture:
- 41 layers
  - 20 with shared W1 and W2
  - 1 unique
- 1024 dims
- 16 GQA heads, 4 KV heads (4:1)
- vocab size 32k
- RoPE + WoRPE + G²LU
Pretraining tokens: 30B
Precision: bfloat16
Tokenizer/Embeddings: facebook/MobileLLM-125M
Hardware: 1 H100

Disclaimer

This model is developed as a research model and it hasn't been tested thoroughly regarding synthesis and coherence quality, as its size is somewhat limiting. Use it at your own risk.

Citation

@misc{ivatchkovitch2026prisma,
      title={Prisma: Interpretability-Inspired Mirrored Transformer Architecture},
      author={Yuri Ivatchkovitch},
      year={2026},
      howpublished={\url{https://huggingface.co/y3i12/Prisma}},
}

Downloads last month: 62

y3i12
/

Prisma

Prisma

Key Concepts

Results

~50M scale prototype (WikiText-103, 4 epochs)

~350M scale prototype — comparison with published models

Training progression (~350M)

Prisma 357M detailed training charts

Quick Start

Load from HuggingFace

Install (for training / development)

Train

Generate

Benchmark

CLI Reference

Architecture

Prisma-Specific

Training

Data

Architecture Details

Why Mirroring Works

Why G²LU Works

Why WoRPE Works

Why Everything Works Together

File Map

Origin

Developer Notes

Training

Disclaimer

Citation

Datasets used to train y3i12/Prisma