TMLM Haiku Series
Collection
Early training stages of something small, yet so big. • 2 items • Updated • 1
⚠️ IMPORTANT NOTICE
- The model is really dumb. This is a sub-1M parameter research model designed for experimentation, not production use.
- Do not expect it to answer any questions. It is prone to repetition, hallucination, and format collapse.
TinyMemoryLM is an ultra-lightweight language model optimized for edge cases and architectural experimentation. Despite its small footprint, it incorporates several novel training innovations aimed at stabilizing tiny model convergence, including hybrid tokenization, loss boosting strategies, and context-aware relevance modeling.
This release includes both Pretrained Weights (base language modeling, completion) and Instruction Weights (fine-tuned for chat).
| File | Description |
|---|---|
tokenizer.json |
Hybrid word/character tokenizer vocabulary (2,133 tokens). |
pretrain.pt |
Base pretrained checkpoint (language modeling). |
model.pt |
Instruction-tuned checkpoint (SFT/Chat). |
samples.jsonl |
Sample generations with NLL/PPL metrics at checkpoints. |
loss_curve.png |
Training loss progression across all phases. |
| Parameter | Value |
|---|---|
| Architecture | Transformer Decoder (GQA) |
| Parameters | ~700K |
| Context Length | 2,048 tokens |
| Sliding Window | 512 tokens |
| Dimensions | d_model=128, unique_layers=8, logical_layers=16, heads=4, kv_heads=2, ffn=224 |
| Vocabulary | ~2,133 tokens (Hybrid Char + Word) |
| Normalization | RMSNorm |
| Embeddings | Rotary Embeddings (RoPE, 25% fraction) |
| Activation | SwiGLU |
| Multi-Token Prediction | Horizons at 2, 3, 4 |
TinyMemoryLM implements several research-focused modifications to standard transformer architectures:
| Parameter | Value |
|---|---|
| Batch Size | 48 |
| Pretrain LR | 8e-4 (min 1e-5) |
| SFT LR | 2e-4 (min 1e-5) |
| Warmup | 300 steps |
| Weight Decay | 0.02 |
| Max Grad Norm | 1.0 |
| MTP Weight | 0.3 |
| Word Token Loss Boost | 3.0x |
| Response-Start Boost | 3.0x (first 20 tokens) |
| Checkpointing | Every 1,000 steps |
| Sampling | Every 5,000 steps |
Training loss progression across pretrain and SFT phases:
Please manage your expectations when using TinyMemoryLM:
Generated for research purposes. Use responsibly.