Instructions to use stukenov/sozkz-moe-mix-200m-kk-base-v1 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use stukenov/sozkz-moe-mix-200m-kk-base-v1 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="stukenov/sozkz-moe-mix-200m-kk-base-v1")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("stukenov/sozkz-moe-mix-200m-kk-base-v1")
model = AutoModelForCausalLM.from_pretrained("stukenov/sozkz-moe-mix-200m-kk-base-v1")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use stukenov/sozkz-moe-mix-200m-kk-base-v1 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "stukenov/sozkz-moe-mix-200m-kk-base-v1"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stukenov/sozkz-moe-mix-200m-kk-base-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/stukenov/sozkz-moe-mix-200m-kk-base-v1

SGLang

How to use stukenov/sozkz-moe-mix-200m-kk-base-v1 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "stukenov/sozkz-moe-mix-200m-kk-base-v1" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stukenov/sozkz-moe-mix-200m-kk-base-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "stukenov/sozkz-moe-mix-200m-kk-base-v1" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "stukenov/sozkz-moe-mix-200m-kk-base-v1",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use stukenov/sozkz-moe-mix-200m-kk-base-v1 with Docker Model Runner:
```
docker model run hf.co/stukenov/sozkz-moe-mix-200m-kk-base-v1
```

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Kazakh MoE 200M (Active 50M)

A Mixture-of-Experts language model for Kazakh, created via sparse upcycling from a dense Llama model.

Model Description

This model was created by converting a trained dense Llama model (kazakh-llama-50m-v2) into a Mixtral-style MoE architecture. Each MLP layer was duplicated into 8 experts with top-2 routing, then fine-tuned for 1000 steps to train the router.

Parameter	Value
Architecture	Mixtral (MoE)
Total parameters	166M
Active parameters	~50M per token
Experts	8 per layer, top-2 routing
Hidden size	512
Layers	8
Attention heads	8
Expert intermediate size	1344
Vocabulary	50,257 (GPT-2 BPE)
Context length	1024

Sparse Upcycling

Sparse upcycling converts a trained dense model into an MoE model by:

Copying all attention weights and layer norms 1-to-1
Duplicating each MLP (gate_proj, up_proj, down_proj) into N identical experts
Randomly initializing the router (gate) weights
Fine-tuning for a short period so the router learns to distribute tokens across experts

This gives the MoE model a strong initialization from the dense model, requiring far less training than training from scratch.

Training

Base model: stukenov/kazakh-llama-50m-v2 (eval_loss: 3.675)
Dataset: stukenov/kazakh-clean-pretrain-v2 (~1B tokens)
Fine-tuning: 1000 steps, LR=1e-4, cosine schedule, batch size 8 x 4 grad accum
Hardware: 1x RTX 4090 (vast.ai), ~12 min training time
Tokenizer: stukenov/kazakh-gpt2-50k

Eval Loss Progression

Step	eval_loss
200	3.687
400	3.680
600	3.675
800	3.674
1000	3.674

Final eval_loss: 3.674 (vs dense baseline 3.675 — matched and slightly improved)

Usage

from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("stukenov/kazakh-moe-200M-A50M")
tokenizer = AutoTokenizer.from_pretrained("stukenov/kazakh-moe-200M-A50M")

input_ids = tokenizer("Қазақстан —", return_tensors="pt").input_ids
output = model.generate(input_ids, max_new_tokens=100, do_sample=True, temperature=0.8)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Part of Kazakh SLM Collection

This model is part of an ongoing project to build small, efficient language models for the Kazakh language. See also:

kazakh-llama-50m-v2 — dense base model
kazakh-gpt2-50k — tokenizer

Downloads last month: -

Safetensors

Model size

0.2B params

Tensor type

BF16

Collection including stukenov/sozkz-moe-mix-200m-kk-base-v1

SozKZ MoE: Mixture of Experts

Collection

Mixture-of-Experts models for Kazakh — upcycled and domain-pretrained MoE architectures • 4 items • Updated Mar 25

Paper for stukenov/sozkz-moe-mix-200m-kk-base-v1

Sparse Upcycling: Training Mixture-of-Experts from Dense Checkpoints

Paper • 2212.05055 • Published Dec 9, 2022 • 6

Evaluation results

eval_loss
self-reported

3.674