You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Log in or Sign Up to review the conditions and access this model content.

SozKZ GEC Morphology 50M

A 50M-parameter Kazakh grammatical error correction (GEC) model specialized in morphological errors — incorrect suffixes, word endings, and vowel harmony violations.

Fine-tuned from stukenov/kazakh-gec-50m (general GEC model) on a filtered subset of morphology-only errors.

How It Works

The model uses a decode-only (causal LM) approach with special tokens to frame GEC as a seq2seq task:

<TASK_FIX><SRC>{erroneous text}<SEP>{corrected text}<EOS>

Loss is computed only on tokens after <SEP>, so the model learns to generate corrections given noisy input.

Usage

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "stukenov/kazakh-gec-morphology-50m"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name, dtype=torch.bfloat16)
model.eval()

def correct(text: str) -> str:
    prompt = f"<TASK_FIX><SRC>{text}<SEP>"
    inputs = tokenizer(prompt, return_tensors="pt", add_special_tokens=False)
    with torch.no_grad():
        out = model.generate(
            **inputs,
            max_new_tokens=256,
            do_sample=False,
            eos_token_id=tokenizer.eos_token_id,
            pad_token_id=tokenizer.pad_token_id,
        )
    return tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)

# Examples
print(correct("Алтай тілі 1948 жылға дайын ойрат тілі деп аталған."))
# → Алтай тілі 1948 жылға дейін ойрат тілі деп аталған.

print(correct("Біз үшін Джаконда мәңгілік Леонардоның шедеврі боліп қалады."))
# → Біз үшін Джаконда мәңгілік Леонардоның шедеврі болып қалады.

Evaluation

Evaluated on 20 morphology examples from the validation set (greedy decoding):

Metric Value
Exact match accuracy 85.0% (17/20)

Example Corrections

Input (erroneous) Expected Model Output
жылға дайын ойрат тілі жылға дейін ойрат тілі жылға дейін ойрат тілі
Қазақстаннің экономикалық Қазақстанның экономикалық Қазақстанның экономикалық
шедеврі боліп қалады шедеврі болып қалады шедеврі болып қалады
өткізуга арналған өткізуге арналған өткізуге арналған
болғандіқтен болғандықтан болғандықтан
Гуманизм турелі пікір Гуманизм туралы пікір Гуманизм туралы пікір
жиі орындаады жиі орындайды жиі орындайды

The model handles suffix corrections, vowel harmony fixes, and possessive ending repairs well. Failure cases involve semantic substitutions or complex case agreement.

Architecture

Parameter Value
Architecture LlamaForCausalLM
Parameters ~50M
Hidden size 512
Layers 8
Attention heads 8
Vocab size 50,260
Context length 1024
Special tokens <TASK_FIX>, <SRC>, <SEP>

Training

Hyperparameter Value
Base model stukenov/kazakh-gec-50m
Dataset stukenov/kazakh-gec-morphology-tokenized
Train examples 638,469 (morphology-filtered + 30% identity)
Validation examples 26,321
Epochs 3
Batch size 8 per device × 4 GPUs × 4 grad accum = 128 effective
Learning rate 1e-5 (cosine schedule)
Warmup 5%
Max grad norm 1.0
Precision bf16
Hardware 4× NVIDIA RTX 4090 (vast.ai)
Training time ~65 minutes
Final train loss 0.055

Training Pipeline

  1. Raw data: stukenov/kazakh-synthetic-gec-datasets (10 sub-datasets)
  2. Error classification: each (input, target) pair classified as morphology, word_order, orthography, mixed, or identity
  3. Morphology filter: same word count, ≤3 character differences per changed word
  4. Identity augmentation: 30% of filtered examples duplicated as (target → target) for robustness
  5. Pre-tokenized and uploaded to HuggingFace for fast DDP training

Limitations

  • Specialized for morphological errors only; may not handle word order or spelling errors well
  • 50M parameters limits capacity for complex corrections
  • Trained on synthetic data — may not generalize perfectly to all real-world errors
  • Best suited for single-error sentences; multi-error correction is less reliable

Part of the SozKZ Project

This model is part of the SLM (Small Language Model for Kazakh) project, building compact, efficient NLP tools for the Kazakh language.

Related models:

Downloads last month
2
Safetensors
Model size
50.6M params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including stukenov/sozkz-fix-mt5-50m-kk-morph-v1