Training on RTX 5090

A ~1M Parameter Model
with 2K Context

TinyMemoryLM is a Mythos-style recurrent-depth transformer trained on RTX 5090. Features a Prelude → shared recurrent block → Coda architecture with depth-wise LoRA, SleepGate periodic memory consolidation, and latent thinking layers on larger tiers. Small model, surprisingly stubborn.

Want to help test new architectures before they ship? AIExpermentLab is our public playground where anyone can contribute — open an issue to suggest a new architecture or open a PR to help build it. All experiments feed into what eventually becomes a released model.

Contribute on GitHub →

In Partnership With

Our Competitors

These are the labs we're watching. The ones pushing the envelope. The ones we hope never exceed our compute budget.

CromIA
Watching
LH Tech AI
Watching
Harley ML
Watching
SupraLabs
Watching

Know another competitor we should be monitoring? Open a community discussion.

TRAINING THREE MODEL TIERS

Glint ~1M params | Live Shard ~50M params | In Training Prism ~100M params | In Training

Download CompactAI Studio

Run our AI models locally on your machine. Chat with models, browse available models, and download them for offline use.

Built with Electron.

~1M
Parameters
2K
Context Length
4+1+4
Prelude+Recurrent+Coda
4
Attention Heads
128
Model Dimension
160
FFN Dimension

Architecture Features

A fresh take on the transformer architecture.

M

Mythos Recurrent-Depth Architecture

Prelude layers → a single shared recurrent block (looped N times with depth-wise LoRA) → Coda layers. Glint runs 4 loops; Shard and Prism run 8. LoRA adapters inject per-loop variation into the shared block so each pass sees different weights without adding full layer parameters.

S

SleepGate (FANT-style Memory)

A cap-slot memory buffer with a zero-initialized output gate. Every N steps the model consolidates memory: similar entries merge (threshold 0.99 for Glint), stale entries are evicted by age. Glint uses 64 memory slots and 4 heads. TRIM-KV retention eviction is disabled at this scale — age-based circular overwrite is sufficient.

T

Latent Thinking Layers

COCONUT-style continuous chain-of-thought: dedicated global-attention transformer blocks that run after the main stack. Shard gets 6 think layers; Prism gets 7. A depth incentive penalizes low cosine similarity between think-block input and output, pressuring the model to actually transform its representations rather than pass through lazily.

R

RTX 5090 Optimized

Tuned for RTX 5090 with flash attention, bf16 mixed precision, and auto batch tuning targeting 95% VRAM utilization. NVFP4 block-scaling quantization via TransformerEngine for Shard and Prism. Glint runs without quantization or gradient checkpointing — stability over speed.

L

Looping Regularization

OpenMythos-style protection against loop collapse. A regularization loss penalizes cosine similarity between the prelude and coda block outputs, preventing the recurrent block from degenerating to a no-op. Lambda 0.05 for Glint, 0.08 for Shard, 0.1 for Prism — scaled to roughly 2% of typical CE loss.

The Architecture

Prelude processes input. Shared recurrent block loops with LoRA adapters. Coda cleans up. SleepGate remembers the important parts.

Input Embedding Token + Positional (RoPE)
Prelude x4 RMSNorm, GQA, SwiGLU FFN
Recurrent Block ×4 loops Shared weights + depth-wise LoRA (rank 8)
Coda x4 RMSNorm, GQA, SwiGLU FFN
Output Head Tied embeddings
d_model 128
heads / kv 4 / 2
ffn_dim 160
recurrent_loops 4
sleep_gate_cap 64 slots
seq_len 2048

Model Series

Three tiers. Different sizes, same architecture family. We named them ourselves and we are quite pleased about it.

Glint ~1M params

Lightweight and experimental. Updated frequently. The scrappy underdog with SleepGate memory.

dim128 arch4+1+4 recurrent loops4 ffn_dim160 context2,048 lr3e-4
Shard ~50M params

Balanced and stable. 6 latent thinking layers. The responsible middle child who actually thinks.

dim512 logical layers18 loops8 ffn_dim1,536 think layers6 lr1e-4
Prism ~100M params

Maximum quality. 7 latent thinking layers. The overachiever who never sleeps and actually does the work.

dim768 logical layers16 loops8 ffn_dim1,920 think layers7 lr1.6e-4

Flare-TTS

A text-to-speech model built and trained by our partner LH Tech AI. 28M parameters. Available now on Hugging Face.

Flare-TTS-28M

A compact TTS model trained from scratch by LH Tech AI, released under the CompactAI-O org. 28M parameters. Built to be small, fast, and usable on consumer hardware. Credit goes entirely to LH Tech AI for the build and training.

Built by LH Tech AI · Hosted under CompactAI-O · 28M parameters

Sample Output

What happens when you train on English text and hope for the best.

Glint — live sample
You: Write a haiku about neural networks
AI: It seems like there are no challenges when you need information about this magical resource.
You: What is the meaning of life?
AI: The meaning of life is described as a positive or subjectivous structure.

AIFinder

A tool that snitches on AI models. Every AI has a writing accent. AIFinder detects it.

🔍

Which AI Wrote This?

Paste any AI-generated text and AIFinder will guess which lab made it. Google, Anthropic, OpenAI, DeepSeek, xAI, and more. It learns from corrections. The more you use it, the smarter it gets.

Anthropic DeepSeek Google OpenAI xAI Mistral MiniMax +4 more

Free API available | 60 requests/min | No API key required

YES WE KNOW IT SUCKS

The tool guesses wrong sometimes. It confuses Anthropic with OpenAI. It confidently identifies Google as DeepSeek. It is basically a parrot with an opinion.

Pro tip: Ask it math and reasoning questions. We trained it on huge amounts of TeichAI datasets (check them out at huggingface.co/TeichAI). It is noticeably better at detecting which math-happy lab produced the output.

That said, I have an AI working on fixing it. I could not be bothered to do it manually.

7+ hours

The AI is trying its best. Poor thing.