You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

Kazakh MoE 200M (Active 50M)

A Mixture-of-Experts language model for Kazakh, created via sparse upcycling from a dense Llama model.

Model Description

This model was created by converting a trained dense Llama model (kazakh-llama-50m-v2) into a Mixtral-style MoE architecture. Each MLP layer was duplicated into 8 experts with top-2 routing, then fine-tuned for 1000 steps to train the router.

Parameter Value
Architecture Mixtral (MoE)
Total parameters 166M
Active parameters ~50M per token
Experts 8 per layer, top-2 routing
Hidden size 512
Layers 8
Attention heads 8
Expert intermediate size 1344
Vocabulary 50,257 (GPT-2 BPE)
Context length 1024

Sparse Upcycling

Sparse upcycling converts a trained dense model into an MoE model by:

  1. Copying all attention weights and layer norms 1-to-1
  2. Duplicating each MLP (gate_proj, up_proj, down_proj) into N identical experts
  3. Randomly initializing the router (gate) weights
  4. Fine-tuning for a short period so the router learns to distribute tokens across experts

This gives the MoE model a strong initialization from the dense model, requiring far less training than training from scratch.

Training

Eval Loss Progression

Step eval_loss
200 3.687
400 3.680
600 3.675
800 3.674
1000 3.674

Final eval_loss: 3.674 (vs dense baseline 3.675 — matched and slightly improved)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("stukenov/kazakh-moe-200M-A50M")
tokenizer = AutoTokenizer.from_pretrained("stukenov/kazakh-moe-200M-A50M")

input_ids = tokenizer("Қазақстан —", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Part of Kazakh SLM Collection

This model is part of an ongoing project to build small, efficient language models for the Kazakh language. See also:

Downloads last month
-
Safetensors
Model size
0.2B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including stukenov/sozkz-moe-mix-200m-kk-base-v1

Paper for stukenov/sozkz-moe-mix-200m-kk-base-v1

Evaluation results