Kazakh MoE 200M (Active 50M)
A Mixture-of-Experts language model for Kazakh, created via sparse upcycling from a dense Llama model.
Model Description
This model was created by converting a trained dense Llama model (kazakh-llama-50m-v2) into a Mixtral-style MoE architecture. Each MLP layer was duplicated into 8 experts with top-2 routing, then fine-tuned for 1000 steps to train the router.
| Parameter | Value |
|---|---|
| Architecture | Mixtral (MoE) |
| Total parameters | 166M |
| Active parameters | ~50M per token |
| Experts | 8 per layer, top-2 routing |
| Hidden size | 512 |
| Layers | 8 |
| Attention heads | 8 |
| Expert intermediate size | 1344 |
| Vocabulary | 50,257 (GPT-2 BPE) |
| Context length | 1024 |
Sparse Upcycling
Sparse upcycling converts a trained dense model into an MoE model by:
- Copying all attention weights and layer norms 1-to-1
- Duplicating each MLP (gate_proj, up_proj, down_proj) into N identical experts
- Randomly initializing the router (gate) weights
- Fine-tuning for a short period so the router learns to distribute tokens across experts
This gives the MoE model a strong initialization from the dense model, requiring far less training than training from scratch.
Training
- Base model: stukenov/kazakh-llama-50m-v2 (eval_loss: 3.675)
- Dataset: stukenov/kazakh-clean-pretrain-v2 (~1B tokens)
- Fine-tuning: 1000 steps, LR=1e-4, cosine schedule, batch size 8 x 4 grad accum
- Hardware: 1x RTX 4090 (vast.ai), ~12 min training time
- Tokenizer: stukenov/kazakh-gpt2-50k
Eval Loss Progression
| Step | eval_loss |
|---|---|
| 200 | 3.687 |
| 400 | 3.680 |
| 600 | 3.675 |
| 800 | 3.674 |
| 1000 | 3.674 |
Final eval_loss: 3.674 (vs dense baseline 3.675 — matched and slightly improved)
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained("stukenov/kazakh-moe-200M-A50M")
tokenizer = AutoTokenizer.from_pretrained("stukenov/kazakh-moe-200M-A50M")
input_ids = tokenizer("Қазақстан —", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))
Part of Kazakh SLM Collection
This model is part of an ongoing project to build small, efficient language models for the Kazakh language. See also:
- kazakh-llama-50m-v2 — dense base model
- kazakh-gpt2-50k — tokenizer
- Downloads last month
- -
Collection including stukenov/sozkz-moe-mix-200m-kk-base-v1
Collection
Mixture-of-Experts models for Kazakh — upcycled and domain-pretrained MoE architectures • 4 items • Updated
Paper for stukenov/sozkz-moe-mix-200m-kk-base-v1
Evaluation results
- eval_lossself-reported3.674