token correction

e538114 verified 2 months ago

7.57 kB

	---
	language:
	- uz
	- en
	license: apache-2.0
	base_model: Qwen/Qwen3-4B
	tags:
	- uzbek
	- qwen3
	- language-model
	- text-generation
	- nlp
	- central-asia
	- low-resource
	- tokenizer-optimization
	datasets:
	- behbudiy/alpaca-cleaned-uz
	- NeuronUz/uzbek-spelling-mcq
	pipeline_tag: text-generation
	model-index:
	- name: NeuronAI-Uzbek
	results:
	- task:
	type: text-generation
	name: Uzbek Language Understanding
	dataset:
	name: UzLiB Benchmark
	type: uzlib
	metrics:
	- type: accuracy
	value: 0.662
	name: Overall Accuracy
	---

	<div align="center">

	# 🇺🇿 NeuronAI-Uzbek

	### The Most Advanced Open-Source Language Model for Uzbek

	[![Model](https://img.shields.io/badge/🤗_Model-NeuronAI--Uzbek-blue)](https://huggingface.co/NeuronUz/NeuronAI-Uzbek)
	[![License](https://img.shields.io/badge/License-Apache_2.0-green.svg)](https://opensource.org/licenses/Apache-2.0)
	[![Base Model](https://img.shields.io/badge/Base-Qwen3--4B-purple)](https://huggingface.co/Qwen/Qwen3-4B)

	🏆 4th Place Globally \| 🥇 1st Place in Uzbekistan on UzLiB Benchmark

	Outperforming GPT-4o, Claude 3.5 Sonnet, and Gemini 2.5 Flash on Uzbek language tasks

	</div>

	---

	## 📊 Key Results

	<div align="center">

	\| Achievement \| Value \|
	\|-------------\|-------\|
	\| UzLiB Overall Score \| 0.662 \|
	\| Global Ranking \| #4 \|
	\| Regional Ranking \| #1 in Uzbekistan \|
	\| Tokenizer Efficiency Improvement \| +22.5% vs Qwen3-4B \|

	</div>

	---

	## 🏆 UzLiB Benchmark Performance

	NeuronAI-Uzbek achieves exceptional performance on the [UzLiB Benchmark](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md), the comprehensive evaluation suite for Uzbek language understanding.

	### Leaderboard Position

	[![image](https://cdn-uploads.huggingface.co/production/uploads/65fc70cbaeca3946b8753017/2xJ9BjS6rPNoRoBAzvW7w.png)](https://github.com/tahrirchi/uzlib/blob/main/LEADERBOARD.md)


	> Note: NeuronAI-Uzbek is the smallest model in the top 10, with only 4B parameters, while competing against models with 100B+ parameters.

	### Performance Comparison vs Original Qwen3-4B

	\| Metric \| Qwen3-4B (Original) \| NeuronAI-Uzbek \| Improvement \|
	\|--------\|:-------------------:\|:--------------:\|:-----------:\|
	\| Overall (All) \| 0.345 \| 0.662 \| +91.9% \|
	\| Correct Word \| 0.351 \| 0.718 \| +104.6% \|
	\| Meaning \| 0.309 \| 0.466 \| +50.8% \|
	\| Meaning in Context \| 0.347 \| 0.333 \| -4.0% \|
	\| Fill-in \| 0.327 \| 0.385 \| +17.7% \|

	---

	## 🔤 Tokenizer Efficiency

	We optimized the tokenizer specifically for Uzbek, achieving significantly better tokenization efficiency (lower fertility rate = fewer tokens per word = faster inference and lower costs).

	### Fertility Rate Comparison

	\| Model \| Fertility Rate \| Std Dev \| Vocab Size \| Improvement vs Qwen3 \|
	\|-------\|:--------------:\|:-------:\|:----------:\|:--------------------:\|
	\| NeuronAI-Uzbek (Ours) 🏆 \| 2.67 \| 0.15 \| 180,000 \| +22.5% \|
	\| Gemma 2-9B \| 3.15 \| 0.22 \| 256,000 \| +8.3% \|
	\| LLaMA 3.1-8B \| 3.32 \| 0.22 \| 128,256 \| +3.7% \|
	\| DeepSeek-V3 \| 3.32 \| 0.21 \| 128,815 \| +3.4% \|
	\| Qwen3-4B (Original) \| 3.44 \| 0.22 \| 151,669 \| - \|

	> Fertility Rate: Average number of tokens per word. Lower is better for efficiency.

	<div align="center">
	<img src="assets/fertility_comparison_chart.png" alt="Tokenizer Fertility Rate Comparison" width="700"/>
	</div>

	### What This Means

	- 22.5% fewer tokens needed to represent Uzbek text
	- Faster inference due to shorter sequences
	- Lower API costs when deployed
	- Better context utilization - fit more content in the same context window

	---

	## 🛠️ Model Details

	### Architecture

	\| Property \| Value \|
	\|----------\|-------\|
	\| Base Model \| Qwen3-4B \|
	\| Parameters \| 4 Billion \|
	\| Vocabulary Size \| 180,000 tokens \|
	\| Context Length \| 32,768 tokens \|
	\| Architecture \| Transformer (Decoder-only) \|
	\| Precision \| BFloat16 \|

	### Training Methodology

	1. Tokenizer Surgery: Extended vocabulary with 40,000 Uzbek-optimized tokens
	2. Embedding Initialization: Semantic initialization using subword composition
	3. Continual Pretraining: Trained on 2B tokens of Uzbek and English text corpus
	4. Instruction Fine-tuning: Aligned using Uzbek and English instruction datasets

	### Training Data

	\| Dataset \| Type \| Purpose \|
	\|---------\|------\|---------\|
	\| Uzbek Web Corpus \| Pretraining \| Language modeling \|
	\| behbudiy/alpaca-cleaned-uz \| SFT \| Uzbek instructions \|
	\| NeuronUz/uzbek-spelling-mcq \| SFT \| Benchmark-targeted training \|
	\| vicgalle/alpaca-gpt4 \| SFT \| English capability retention \|

	---

	## 🚀 Quick Start

	### Installation

	```bash
	pip install transformers torch
	```

	### Basic Usage

	```python
	from transformers import AutoModelForCausalLM, AutoTokenizer

	model_name = "NeuronUz/NeuronAI-Uzbek"

	tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
	model = AutoModelForCausalLM.from_pretrained(
	model_name,
	torch_dtype="auto",
	device_map="auto",
	trust_remote_code=True
	)

	prompt = "O'zbekiston haqida qisqacha ma'lumot bering."

	messages = [
	{"role": "user", "content": prompt}
	]

	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True
	)

	inputs = tokenizer(text, return_tensors="pt").to(model.device)
	outputs = model.generate(
	**inputs,
	max_new_tokens=512,
	temperature=0.7,
	top_p=0.9,
	do_sample=True
	)

	response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
	print(response)
	```

	### With Thinking Mode (Chain-of-Thought)

	```python
	messages = [
	{"role": "user", "content": "5 ta 3 ga bo'linuvchi 100 dan kichik natural sonlarni toping."}
	]

	text = tokenizer.apply_chat_template(
	messages,
	tokenize=False,
	add_generation_prompt=True,
	enable_thinking=True # Enable step-by-step reasoning
	)
	```

	---

	## 📈 Use Cases

	NeuronAI-Uzbek excels at:

	- 📝 Text Generation: Creative writing, content creation in Uzbek
	- ❓ Question Answering: Answering questions about Uzbek culture, history, and general knowledge
	- 📚 Reading Comprehension: Understanding and analyzing Uzbek texts
	- 🔤 Grammar & Spelling: Uzbek language correctness tasks
	- 🌐 Translation Assistance: Uzbek-English language tasks
	- 💬 Conversational AI: Building Uzbek chatbots and assistants

	---

	## ⚠️ Limitations

	- Knowledge Cutoff: Training data has a knowledge cutoff date
	- Hallucinations: May generate plausible-sounding but incorrect information
	- Bias: May reflect biases present in training data
	- Not for Critical Applications: Should not be used for medical, legal, or safety-critical applications without human oversight

	---

	## 📜 License

	This model is released under the [Apache 2.0 License](https://www.apache.org/licenses/LICENSE-2.0).

	---

	## 🙏 Acknowledgments

	- Qwen Team at Alibaba for the excellent Qwen3-4B base model
	- UzLiB Benchmark creators for the comprehensive evaluation framework
	- Uzbek NLP Community for datasets and linguistic resources

	---

	## 📖 Citation

	```bibtex
	@misc{neuronai-uzbek-2025,
	title={NeuronAI-Uzbek: An Optimized Language Model for Uzbek},
	author={NeuronAI Team},
	year={2025},
	publisher={Hugging Face},
	url={https://huggingface.co/NeuronUz/NeuronAI-Uzbek}
	}
	```

	---

	<div align="center">

	Built with ❤️ in Uzbekistan by [NeuronUz](https://huggingface.co/NeuronUz)

	</div>