| --- |
| language: |
| - uz |
| - en |
| license: apache-2.0 |
| base_model: Qwen/Qwen3-4B |
| tags: |
| - uzbek |
| - qwen3 |
| - language-model |
| - text-generation |
| - nlp |
| - central-asia |
| - low-resource |
| - tokenizer-optimization |
| datasets: |
| - behbudiy/alpaca-cleaned-uz |
| - NeuronUz/uzbek-spelling-mcq |
| pipeline_tag: text-generation |
| model-index: |
| - name: NeuronAI-Uzbek |
| results: |
| - task: |
| type: text-generation |
| name: Uzbek Language Understanding |
| dataset: |
| name: UzLiB Benchmark |
| type: uzlib |
| metrics: |
| - type: accuracy |
| value: 0.662 |
| name: Overall Accuracy |
| --- |
| |
| <div align="center"> |
|
|
| # πΊπΏ NeuronAI-Uzbek |
|
|
| ### The Most Advanced Open-Source Language Model for Uzbek |
|
|
| [](https://huggingface.co/NeuronUz/NeuronAI-Uzbek) |
| [](https://opensource.org/licenses/Apache-2.0) |
| [](https://huggingface.co/Qwen/Qwen3-4B) |
|
|
| **π 4th Place Globally | π₯ 1st Place in Uzbekistan on UzLiB Benchmark** |
|
|
| *Outperforming GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash on Uzbek language tasks* |
|
|
| </div> |
|
|
| --- |
|
|
| ## π Key Results |
|
|
| <div align="center"> |
|
|
| | Achievement | Value | |
| |-------------|-------| |
| | **UzLiB Overall Score** | **0.662** | |
| | **Global Ranking** | **#4** | |
| | **Regional Ranking** | **#1 in Uzbekistan** | |
| | **Tokenizer Efficiency Improvement** | **+22.5%** vs Qwen3-4B | |
|
|
| </div> |
|
|
| --- |
|
|
| ## π UzLiB Benchmark Performance |
|
|
| NeuronAI-Uzbek achieves exceptional performance on the [UzLiB Benchmark](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md), the comprehensive evaluation suite for Uzbek language understanding. |
|
|
| ### Leaderboard Position |
|
|
| [](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md) |
|
|
|
|
| > **Note**: NeuronAI-Uzbek is the **smallest model** in the top 10, with only **4B parameters**, while competing against models with 100B+ parameters. |
|
|
| ### Performance Comparison vs Original Qwen3-4B |
|
|
| | Metric | Qwen3-4B (Original) | NeuronAI-Uzbek | Improvement | |
| |--------|:-------------------:|:--------------:|:-----------:| |
| | **Overall (All)** | 0.345 | **0.662** | **+91.9%** | |
| | Correct Word | 0.351 | 0.718 | +104.6% | |
| | Meaning | 0.309 | 0.466 | +50.8% | |
| | Meaning in Context | 0.347 | 0.333 | -4.0% | |
| | Fill-in | 0.327 | 0.385 | +17.7% | |
|
|
| --- |
|
|
| ## π€ Tokenizer Efficiency |
|
|
| We optimized the tokenizer specifically for Uzbek, achieving significantly better tokenization efficiency (lower fertility rate = fewer tokens per word = faster inference and lower costs). |
|
|
| ### Fertility Rate Comparison |
|
|
| | Model | Fertility Rate | Std Dev | Vocab Size | Improvement vs Qwen3 | |
| |-------|:--------------:|:-------:|:----------:|:--------------------:| |
| | **NeuronAI-Uzbek (Ours)** π | **2.67** | 0.15 | 180,000 | **+22.5%** | |
| | Gemma 2-9B | 3.15 | 0.22 | 256,000 | +8.3% | |
| | LLaMA 3.1-8B | 3.32 | 0.22 | 128,256 | +3.7% | |
| | DeepSeek-V3 | 3.32 | 0.21 | 128,815 | +3.4% | |
| | Qwen3-4B (Original) | 3.44 | 0.22 | 151,669 | - | |
|
|
| > **Fertility Rate**: Average number of tokens per word. Lower is better for efficiency. |
|
|
| <div align="center"> |
| <img src="assets/fertility_comparison_chart.png" alt="Tokenizer Fertility Rate Comparison" width="700"/> |
| </div> |
|
|
| ### What This Means |
|
|
| - **22.5% fewer tokens** needed to represent Uzbek text |
| - **Faster inference** due to shorter sequences |
| - **Lower API costs** when deployed |
| - **Better context utilization** - fit more content in the same context window |
|
|
| --- |
|
|
| ## π οΈ Model Details |
|
|
| ### Architecture |
|
|
| | Property | Value | |
| |----------|-------| |
| | **Base Model** | Qwen3-4B | |
| | **Parameters** | 4 Billion | |
| | **Vocabulary Size** | 180,000 tokens | |
| | **Context Length** | 32,768 tokens | |
| | **Architecture** | Transformer (Decoder-only) | |
| | **Precision** | BFloat16 | |
|
|
| ### Training Methodology |
|
|
| 1. **Tokenizer Surgery**: Extended vocabulary with 40,000 Uzbek-optimized tokens |
| 2. **Embedding Initialization**: Semantic initialization using subword composition |
| 3. **Continual Pretraining**: Trained on 2B tokens of Uzbek and English text corpus |
| 4. **Instruction Fine-tuning**: Aligned using Uzbek and English instruction datasets |
|
|
| ### Training Data |
|
|
| | Dataset | Type | Purpose | |
| |---------|------|---------| |
| | Uzbek Web Corpus | Pretraining | Language modeling | |
| | behbudiy/alpaca-cleaned-uz | SFT | Uzbek instructions | |
| | NeuronUz/uzbek-spelling-mcq | SFT | Benchmark-targeted training | |
| | vicgalle/alpaca-gpt4 | SFT | English capability retention | |
|
|
| --- |
|
|
| ## π Quick Start |
|
|
| ### Installation |
|
|
| ```bash |
| pip install transformers torch |
| ``` |
|
|
| ### Basic Usage |
|
|
| ```python |
| from transformers import AutoModelForCausalLM, AutoTokenizer |
| |
| model_name = "NeuronUz/NeuronAI-Uzbek" |
| |
| tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True) |
| model = AutoModelForCausalLM.from_pretrained( |
| model_name, |
| torch_dtype="auto", |
| device_map="auto", |
| trust_remote_code=True |
| ) |
| |
| prompt = "O'zbekiston haqida qisqacha ma'lumot bering." |
| |
| messages = [ |
| {"role": "user", "content": prompt} |
| ] |
| |
| text = tokenizer.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True |
| ) |
| |
| inputs = tokenizer(text, return_tensors="pt").to(model.device) |
| outputs = model.generate( |
| **inputs, |
| max_new_tokens=512, |
| temperature=0.7, |
| top_p=0.9, |
| do_sample=True |
| ) |
| |
| response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) |
| print(response) |
| ``` |
|
|
| ### With Thinking Mode (Chain-of-Thought) |
|
|
| ```python |
| messages = [ |
| {"role": "user", "content": "5 ta 3 ga bo'linuvchi 100 dan kichik natural sonlarni toping."} |
| ] |
| |
| text = tokenizer.apply_chat_template( |
| messages, |
| tokenize=False, |
| add_generation_prompt=True, |
| enable_thinking=True # Enable step-by-step reasoning |
| ) |
| ``` |
|
|
| --- |
|
|
| ## π Use Cases |
|
|
| NeuronAI-Uzbek excels at: |
|
|
| - **π Text Generation**: Creative writing, content creation in Uzbek |
| - **β Question Answering**: Answering questions about Uzbek culture, history, and general knowledge |
| - **π Reading Comprehension**: Understanding and analyzing Uzbek texts |
| - **π€ Grammar & Spelling**: Uzbek language correctness tasks |
| - **π Translation Assistance**: Uzbek-English language tasks |
| - **π¬ Conversational AI**: Building Uzbek chatbots and assistants |
|
|
| --- |
|
|
| ## β οΈ Limitations |
|
|
| - **Knowledge Cutoff**: Training data has a knowledge cutoff date |
| - **Hallucinations**: May generate plausible-sounding but incorrect information |
| - **Bias**: May reflect biases present in training data |
| - **Not for Critical Applications**: Should not be used for medical, legal, or safety-critical applications without human oversight |
|
|
| --- |
|
|
| ## π License |
|
|
| This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0). |
|
|
| --- |
|
|
| ## π Acknowledgments |
|
|
| - **Qwen Team** at Alibaba for the excellent Qwen3-4B base model |
| - **UzLiB Benchmark** creators for the comprehensive evaluation framework |
| - **Uzbek NLP Community** for datasets and linguistic resources |
|
|
| --- |
|
|
| ## π Citation |
|
|
| ```bibtex |
| @misc{neuronai-uzbek-2025, |
| title={NeuronAI-Uzbek: An Optimized Language Model for Uzbek}, |
| author={NeuronAI Team}, |
| year={2025}, |
| publisher={Hugging Face}, |
| url={https://huggingface.co/NeuronUz/NeuronAI-Uzbek} |
| } |
| ``` |
|
|
| --- |
|
|
| <div align="center"> |
|
|
| **Built with β€οΈ in Uzbekistan by [NeuronUz](https://huggingface.co/NeuronUz)** |
|
|
| </div> |
|
|