You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

SozKZ GEC Morphology 50M

A 50M-parameter Kazakh grammatical error correction (GEC) model specialized in morphological errors — incorrect suffixes, word endings, and vowel harmony violations.

Fine-tuned from stukenov/kazakh-gec-50m (general GEC model) on a filtered subset of morphology-only errors.

How It Works

The model uses a decode-only (causal LM) approach with special tokens to frame GEC as a seq2seq task:

<TASK_FIX><SRC>{erroneous text}<SEP>{corrected text}<EOS>

Loss is computed only on tokens after <SEP>, so the model learns to generate corrections given noisy input.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "stukenov/kazakh-gec-morphology-50m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.bfloat16)
model.eval()

def correct(text: str) -> str:
    prompt = f"<TASK_FIX><SRC>{text}<SEP>"
    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=256,
            do_sample=False,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )
    return tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

# Examples
print(correct("Алтай тілі 1948 жылға дайын ойрат тілі деп аталған."))
# → Алтай тілі 1948 жылға дейін ойрат тілі деп аталған.

print(correct("Біз үшін Джаконда мәңгілік Леонардоның шедеврі боліп қалады."))
# → Біз үшін Джаконда мәңгілік Леонардоның шедеврі болып қалады.

Evaluation

Evaluated on 20 morphology examples from the validation set (greedy decoding):

Metric	Value
Exact match accuracy	85.0% (17/20)

Example Corrections

Input (erroneous)	Expected	Model Output
жылға дайын ойрат тілі	жылға дейін ойрат тілі	жылға дейін ойрат тілі	✓
Қазақстаннің экономикалық	Қазақстанның экономикалық	Қазақстанның экономикалық	✓
шедеврі боліп қалады	шедеврі болып қалады	шедеврі болып қалады	✓
өткізуга арналған	өткізуге арналған	өткізуге арналған	✓
болғандіқтен	болғандықтан	болғандықтан	✓
Гуманизм турелі пікір	Гуманизм туралы пікір	Гуманизм туралы пікір	✓
жиі орындаады	жиі орындайды	жиі орындайды	✓

The model handles suffix corrections, vowel harmony fixes, and possessive ending repairs well. Failure cases involve semantic substitutions or complex case agreement.

Architecture

Parameter	Value
Architecture	LlamaForCausalLM
Parameters	~50M
Hidden size	512
Layers	8
Attention heads	8
Vocab size	50,260
Context length	1024
Special tokens	`<TASK_FIX>`, `<SRC>`, `<SEP>`

Training

Hyperparameter	Value
Base model	stukenov/kazakh-gec-50m
Dataset	stukenov/kazakh-gec-morphology-tokenized
Train examples	638,469 (morphology-filtered + 30% identity)
Validation examples	26,321
Epochs	3
Batch size	8 per device × 4 GPUs × 4 grad accum = 128 effective
Learning rate	1e-5 (cosine schedule)
Warmup	5%
Max grad norm	1.0
Precision	bf16
Hardware	4× NVIDIA RTX 4090 (vast.ai)
Training time	~65 minutes
Final train loss	0.055

Training Pipeline

Raw data: stukenov/kazakh-synthetic-gec-datasets (10 sub-datasets)
Error classification: each (input, target) pair classified as morphology, word_order, orthography, mixed, or identity
Morphology filter: same word count, ≤3 character differences per changed word
Identity augmentation: 30% of filtered examples duplicated as (target → target) for robustness
Pre-tokenized and uploaded to HuggingFace for fast DDP training

Limitations

Specialized for morphological errors only; may not handle word order or spelling errors well
50M parameters limits capacity for complex corrections
Trained on synthetic data — may not generalize perfectly to all real-world errors
Best suited for single-error sentences; multi-error correction is less reliable

Part of the SozKZ Project

This model is part of the SLM (Small Language Model for Kazakh) project, building compact, efficient NLP tools for the Kazakh language.

Related models:

stukenov/kazakh-llama-50m-v2 — base pretrained model
stukenov/kazakh-gec-50m — general GEC model (parent)

Downloads last month: 2

Safetensors

Model size

50.6M params

Tensor type

F32

Collection including stukenov/sozkz-fix-mt5-50m-kk-morph-v1

SozKZ GEC: Kazakh Grammar Error Correction

Collection

Grammar error correction models and datasets for Kazakh — Llama GEC (300M, 600M), mT5 GEC, morphology models • 10 items • Updated 28 days ago