Mixture of Recursion Language Model - 198M (Adaptive Computation)

A novel 198M parameter conversational language model featuring adaptive recursive computation through perplexity-based dynamic routing. Achieves better performance than GPT-2 Medium (345M) with 57% fewer parameters.

🏆 Key Achievement

Perplexity: 15.37 - Outperforms GPT-2 Medium (22 perplexity) despite being significantly smaller.

🔥 Innovation

This model introduces a self-supervised curriculum learning approach where the model learns to allocate computation based on sample difficulty without any manual labeling.

Novel Architecture: Mixture of Recursion

Instead of applying uniform computation to all inputs, this model features:

Perplexity-Based Router: Neural classifier that learns sample difficulty from the model's own confidence
Adaptive Recursion: Dynamically allocates 1, 3, or 5 recursive transformer passes based on input complexity
Self-Supervised Learning: No manual labels - the model learns what's "hard" from its own perplexity signals
NaN-Safe Attention: Uses -1e4 masking instead of -inf for stable fp16 training

How It Works

High Perplexity (>50) → Model struggling → Use 5 recursion steps
Medium Perplexity (20-50) → Moderate difficulty → Use 3 steps
Low Perplexity (<20) → Model confident → Use 1 step (efficient!)

This enables intelligent compute allocation - simple inputs get fast processing, complex inputs get deeper reasoning.

📊 Performance

Benchmark Comparison

Model	Parameters	Perplexity	Notes
This Model	198M	15.37	Novel adaptive architecture
GPT-2 Small	117M	~29	Smaller baseline
GPT-2 Medium	345M	~22	1.7× larger
GPT-2 Large	774M	~18	3.9× larger
Random Baseline	-	~50,000	Theoretical worst

Training Metrics

📈 Training Progression:
Epoch 1: 21.75 perplexity  (Val loss: 3.0798)
Epoch 2: 15.37 perplexity  (Val loss: 2.7326) ← BEST

📉 Loss Reduction:
Train loss: 4.5081 → 3.3068 (27% reduction)
Val loss:   3.0798 → 2.7326 (11% reduction)
nan_batches: 0 (stable training throughout)

Performance Highlights

✅ 15.37 perplexity on validation set (2,000 samples)
✅ Better than GPT-2 Medium with 57% fewer parameters
✅ Efficient inference - adaptive computation saves resources
✅ Novel architecture - not just fine-tuning
✅ Stable training - 0 NaN batches across 150K steps

🎯 Model Architecture

Specifications

Component	Configuration
Total Parameters	~198M
Vocabulary Size	50,260 tokens (GPT-2 BPE + special tokens)
Embedding Dimension	768
Base Transformer Layers	16
Attention Heads	12 heads per layer
Head Dimension	64 (768 ÷ 12)
FFN Intermediate Size	3072
Max Sequence Length	512 tokens
Positional Encoding	Rotary Positional Embeddings (RoPE)
Dropout Rate	0.1 (hidden & attention)

Recursion Configuration

Complexity Level	Perplexity Range	Steps	Use Case
Simple	< 20	1 step	Model is confident
Medium	20-50	3 steps	Moderate difficulty
Complex	> 50	5 steps	Model struggling, needs deep reasoning

Architecture Components

Token Embedding Layer (50,260 × 768)
- Special tokens: <|user|>, <|assistant|>, <|pad|>, <|endoftext|>
Base Transformer Stack (16 layers)
- Multi-head self-attention with RoPE
- Feed-forward networks (768 → 3072 → 768)
- Pre-normalization with LayerNorm
- Residual connections
- NaN-safe attention masking (-1e4)
Perplexity-Based Router (~1.2M params)
- Attention-weighted sequence pooling
- 2-layer MLP classifier (768 → 384 → 3)
- Trained on pseudo-labels from sample perplexity
- Outputs: complexity class (0=simple, 1=medium, 2=complex)
Recursive Refinement Layer (~7M params)
- Transformer block applied 1-5 times adaptively
- Same architecture as base layers
- Reused weights for parameter efficiency
Output Projection
- Final LayerNorm
- Linear layer (768 → 50,260)

🚀 Quick Start

Installation

pip install transformers torch

Basic Usage

from mixture_of_recursion import RecursiveLanguageModel
from transformers import AutoTokenizer
import torch

# Load model
model = RecursiveLanguageModel.from_pretrained(
    "Girinath11/recursive-language-model-198m",
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(
    "Girinath11/recursive-language-model-198m",
    trust_remote_code=True
)

device = "cuda" if torch.cuda.is_available() else "cpu"
model = model.to(device)
model.eval()

print(f"✅ Model loaded on {device}")
print(f"📊 Parameters: {sum(p.numel() for p in model.parameters()):,}")

Conversational Format

def chat(question, max_new_tokens=150, temperature=0.7, top_p=0.9):
    # Must use chat format — model trained on this
    prompt = f"<|user|>\n{question}\n<|assistant|>\n"

    inputs = tokenizer(
        prompt,
        return_tensors="pt",
        add_special_tokens=False
    ).to(device)

    with torch.no_grad():
        outputs = model.generate(
            inputs['input_ids'],
            max_new_tokens=max_new_tokens,
            temperature=temperature,
            top_p=top_p,
            do_sample=True,
        )

    full_text = tokenizer.decode(outputs[0], skip_special_tokens=True)

    if "<|assistant|>" in full_text:
        return full_text.split("<|assistant|>")[-1].strip()
    return full_text

# Example
print(chat("What is machine learning?"))
print(chat("Explain neural networks simply"))
print(chat("What is Python programming?"))

Generation Parameters

# Creative writing (high temperature)
response = chat("Write a story about AI", temperature=1.0, top_p=0.95)

# Focused/deterministic (low temperature)
response = chat("Explain quantum computing", temperature=0.5, top_p=0.9)

📚 Training Details

Dataset

Total Training Samples: 150,000 (high-quality conversational data)

Dataset	Samples	Percentage	Description
Anthropic HH-RLHF	80,000	53%	Helpful & harmless responses
UltraChat	50,000	33%	GPT-4 generated multi-turn dialogues
Alpaca-GPT4	20,000	14%	Instruction following data
Validation	2,000	-	Held-out evaluation samples

Data Format:

<|user|>
User message here
<|assistant|>
Assistant response here
<|endoftext|>

Training Configuration

Hardware:
  GPU: NVIDIA Tesla T4 (15.6 GB)
  Platform: Kaggle
  Framework: PyTorch + HuggingFace Transformers
  Mixed Precision: FP16 (AMP)

Hyperparameters:
  Batch Size: 2
  Gradient Accumulation: 32
  Effective Batch Size: 64
  Max Sequence Length: 512
  Learning Rate: 1e-4
  Optimizer: AdamW (betas: 0.9, 0.95)
  Weight Decay: 0.01
  LR Schedule: LinearWarmup + CosineAnnealing
  Warmup Steps: 500
  Gradient Clipping: 1.0
  Total Epochs: 2
  Total Steps: 150,000

Loss Function:
  Language Modeling: CrossEntropyLoss
  Router Loss: CrossEntropyLoss (weight: 0.1)
  Total: LM Loss + 0.1 × Router Loss

Key Fix:
  Attention Mask: -1e4 (NOT -inf) → prevents NaN in fp16
  Pad Token: Separate from EOS token

Training Progression

Epoch	Train Loss	Val Loss	Perplexity	NaN Batches
1	4.5081	3.0798	21.75	0
2	3.3068	2.7326	15.37 🔥	0

Key Observations:

Loss decreased steadily across all epochs
Zero NaN batches — completely stable training
29% perplexity improvement from Epoch 1 to 2
Training time: ~9h 12m on single T4 GPU

💡 Technical Innovation

Perplexity-Based Routing (Novel Contribution)

Traditional transformers apply the same computational depth to all inputs. This model recognizes that:

Simple inputs need minimal processing
Complex inputs benefit from deeper iterative refinement

Key Innovation: The model learns complexity from its own performance:

# During training (self-supervised):
sample_perplexity = exp(sample_loss)

if sample_perplexity < 20:
    label = 0  # Simple — model is confident
elif sample_perplexity < 50:
    label = 1  # Medium — model is uncertain
else:
    label = 2  # Complex — model struggling

# Router learns to predict this from input features
router_loss = CrossEntropyLoss(router_logits, label)
total_loss = lm_loss + 0.1 * router_loss

Benefits:

✅ No manual labeling — fully self-supervised
✅ Adapts as model learns — natural curriculum
✅ Objective measure — based on actual model performance
✅ Efficient — simple inputs use 1/5 the compute of complex ones

NaN-Safe Attention Masking

Critical fix for stable fp16 training:

# Unstable (causes NaN in fp16):
mask.fill_(float('-inf'))

# Stable ✅:
mask.fill_(-1e4)
# Then clamp before softmax:
scores = scores.clamp(min=-1e4, max=1e4)
attn = F.softmax(scores, dim=-1)
attn = torch.nan_to_num(attn, nan=0.0)

Self-Supervised Curriculum Learning

Early training: Most samples are "complex" (high perplexity)
Mid training: Distribution shifts as model learns
Late training: More samples become "simple" (low perplexity)

Natural curriculum without human intervention.

🎯 Use Cases

✅ Recommended

Research — Experimenting with adaptive computation
Educational demos — Teaching language model concepts
Prototyping — Conversational AI applications
Creative writing — With human review

⚠️ Not Recommended

Production chatbots without fine-tuning
Medical, legal, or financial advice
Safety-critical systems

⚠️ Limitations

Context Window: 512 tokens maximum
Training Data: 150K samples — smaller than commercial models
Single Language: Primarily English
Repetition: May repeat phrases in very long generations (>200 tokens)
Factual Accuracy: May generate plausible but incorrect information

📁 Repository Structure

recursive-language-model-198m/
├── config.json                 # Model configuration
├── model.safetensors           # Model weights (794 MB)
├── tokenizer.json              # Tokenizer vocabulary
├── tokenizer_config.json       # Tokenizer settings
├── special_tokens_map.json     # Special token mappings
├── vocab.json                  # GPT-2 vocabulary
├── merges.txt                  # BPE merges
├── mixture_of_recursion.py     # Model architecture code
├── model_usage.py              # Usage examples
└── README.md                   # This file

📄 Citation

@misc{mixture-of-recursion-198m-2026,
  author = {Girinath11},
  title = {Mixture of Recursion Language Model with Perplexity-Based Dynamic Routing},
  year = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Girinath11/recursive-language-model-198m}},
  note = {198M parameter model with adaptive computation achieving 15.37 perplexity}
}

🙏 Acknowledgments

Anthropic for HH-RLHF dataset
Tsinghua University for UltraChat dataset
Vicgalle for Alpaca-GPT4 dataset
Hugging Face for Transformers library
Kaggle for free GPU access

📝 Version History

v1.0 (March 2026)

Initial release
198M parameters
Perplexity: 15.37
Trained on 150K samples
2 epochs, 9h 12m training time
NaN-safe fp16 training

📜 License

MIT License — Free to use with attribution

Model Status: ✅ Ready for Research & Prototyping
Last Updated: March 2026

Downloads last month: 476

Safetensors

Model size

0.2B params

Tensor type

I64

F32

Model tree for Girinath11/recursive-language-model-198m

Base model

openai-community/gpt2

Finetuned

(2096)

this model

Girinath11
/

recursive-language-model-198m