SozKZ GEC Morphology 50M
A 50M-parameter Kazakh grammatical error correction (GEC) model specialized in morphological errors — incorrect suffixes, word endings, and vowel harmony violations.
Fine-tuned from stukenov/kazakh-gec-50m (general GEC model) on a filtered subset of morphology-only errors.
How It Works
The model uses a decode-only (causal LM) approach with special tokens to frame GEC as a seq2seq task:
<TASK_FIX><SRC>{erroneous text}<SEP>{corrected text}<EOS>
Loss is computed only on tokens after <SEP>, so the model learns to generate corrections given noisy input.
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "stukenov/kazakh-gec-morphology-50m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.bfloat16)
model.eval()
def correct(text: str) -> str:
prompt = f"<TASK_FIX><SRC>{text}<SEP>"
inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
eos_token_id=tokenizer.eos_token_id,
pad_token_id=tokenizer.pad_token_id,
)
return tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
# Examples
print(correct("Алтай тілі 1948 жылға дайын ойрат тілі деп аталған."))
# → Алтай тілі 1948 жылға дейін ойрат тілі деп аталған.
print(correct("Біз үшін Джаконда мәңгілік Леонардоның шедеврі боліп қалады."))
# → Біз үшін Джаконда мәңгілік Леонардоның шедеврі болып қалады.
Evaluation
Evaluated on 20 morphology examples from the validation set (greedy decoding):
| Metric | Value |
|---|---|
| Exact match accuracy | 85.0% (17/20) |
Example Corrections
| Input (erroneous) | Expected | Model Output | |
|---|---|---|---|
| жылға дайын ойрат тілі | жылға дейін ойрат тілі | жылға дейін ойрат тілі | ✓ |
| Қазақстаннің экономикалық | Қазақстанның экономикалық | Қазақстанның экономикалық | ✓ |
| шедеврі боліп қалады | шедеврі болып қалады | шедеврі болып қалады | ✓ |
| өткізуга арналған | өткізуге арналған | өткізуге арналған | ✓ |
| болғандіқтен | болғандықтан | болғандықтан | ✓ |
| Гуманизм турелі пікір | Гуманизм туралы пікір | Гуманизм туралы пікір | ✓ |
| жиі орындаады | жиі орындайды | жиі орындайды | ✓ |
The model handles suffix corrections, vowel harmony fixes, and possessive ending repairs well. Failure cases involve semantic substitutions or complex case agreement.
Architecture
| Parameter | Value |
|---|---|
| Architecture | LlamaForCausalLM |
| Parameters | ~50M |
| Hidden size | 512 |
| Layers | 8 |
| Attention heads | 8 |
| Vocab size | 50,260 |
| Context length | 1024 |
| Special tokens | <TASK_FIX>, <SRC>, <SEP> |
Training
| Hyperparameter | Value |
|---|---|
| Base model | stukenov/kazakh-gec-50m |
| Dataset | stukenov/kazakh-gec-morphology-tokenized |
| Train examples | 638,469 (morphology-filtered + 30% identity) |
| Validation examples | 26,321 |
| Epochs | 3 |
| Batch size | 8 per device × 4 GPUs × 4 grad accum = 128 effective |
| Learning rate | 1e-5 (cosine schedule) |
| Warmup | 5% |
| Max grad norm | 1.0 |
| Precision | bf16 |
| Hardware | 4× NVIDIA RTX 4090 (vast.ai) |
| Training time | ~65 minutes |
| Final train loss | 0.055 |
Training Pipeline
- Raw data: stukenov/kazakh-synthetic-gec-datasets (10 sub-datasets)
- Error classification: each (input, target) pair classified as
morphology,word_order,orthography,mixed, oridentity - Morphology filter: same word count, ≤3 character differences per changed word
- Identity augmentation: 30% of filtered examples duplicated as (target → target) for robustness
- Pre-tokenized and uploaded to HuggingFace for fast DDP training
Limitations
- Specialized for morphological errors only; may not handle word order or spelling errors well
- 50M parameters limits capacity for complex corrections
- Trained on synthetic data — may not generalize perfectly to all real-world errors
- Best suited for single-error sentences; multi-error correction is less reliable
Part of the SozKZ Project
This model is part of the SLM (Small Language Model for Kazakh) project, building compact, efficient NLP tools for the Kazakh language.
Related models:
- stukenov/kazakh-llama-50m-v2 — base pretrained model
- stukenov/kazakh-gec-50m — general GEC model (parent)
- Downloads last month
- 2