--- language: ar language_name: Arabic language_family: arabic tags: - wikilangs - nlp - tokenizer - embeddings - n-gram - markov - wikipedia - feature-extraction - sentence-similarity - tokenization - n-grams - markov-chain - text-mining - fasttext - babelvec - vocabulous - vocabulary - monolingual - family-arabic license: mit library_name: wikilangs pipeline_tag: text-generation datasets: - omarkamali/wikipedia-monthly dataset_info: name: wikipedia-monthly description: Monthly snapshots of Wikipedia articles across 300+ languages metrics: - name: best_compression_ratio type: compression value: 4.347 - name: best_isotropy type: isotropy value: 0.8111 - name: best_alignment_r10 type: alignment value: 0.7660 - name: vocabulary_size type: vocab value: 986324 generated: 2026-03-04 --- # Arabic โ€” Wikilangs Models Open-source tokenizers, n-gram & Markov language models, vocabulary stats, and word embeddings trained on **Arabic** Wikipedia by [Wikilangs](https://wikilangs.org). ๐ŸŒ [Language Page](https://wikilangs.org/languages/ar/) ยท ๐ŸŽฎ [Playground](https://wikilangs.org/playground/?lang=ar) ยท ๐Ÿ“Š [Full Research Report](RESEARCH_REPORT.md) ## Language Samples Example sentences drawn from the Arabic Wikipedia corpus: > ุชุตุบูŠุฑ K \ ูƒูŠ \ ู‡ูˆ ุงู„ุญุฑู ุงู„ุญุงุฏูŠ ุงู„ุนุดุฑ ููŠ ุงู„ุฃุจุฌุฏูŠุฉ The Oxford English Dictionary, 2nd ed., online ูˆูŠู…ุซู„ ู‡ุฐุง ุงู„ุญุฑู ุงู„ุตูˆุช ุงู„ุทุจู‚ูŠ ุงู„ูˆู‚ููŠ ุงู„ู…ู‡ู…ูˆุณ ููŠ ุงู„ูƒูŠู…ูŠุงุกุŒ ูŠุฑู…ุฒ K ู„ุนู†ุตุฑ ุงู„ุจูˆุชุงุณูŠูˆู… ู…ุฑุงุฌุน ู„ุงุชูŠู†ูŠุฉ > : ุฅุญุฏู‰ ูˆู„ุงูŠุงุช ุงู„ูˆู„ุงูŠุงุช ุงู„ู…ุชุญุฏุฉ ุงู„ุฃู…ุฑูŠูƒูŠุฉ. ู…ุฏูŠู†ุฉ ู†ูŠูˆูŠูˆุฑูƒ: ุฃูƒุจุฑ ู…ุฏู† ุงู„ูˆู„ุงูŠุงุช ุงู„ู…ุชุญุฏุฉ ุงู„ุฃู…ุฑูŠูƒูŠุฉ ูˆุฅุญุฏู‰ ุฃูƒุจุฑู‡ุง ููŠ ุงู„ุนุงู„ู…. ู…ู‚ุงุทุนุฉ ู†ูŠูˆูŠูˆุฑูƒ: ุฅุญุฏู‰ ู…ู‚ุงุทุนุงุช ูˆู„ุงูŠุฉ ู†ูŠูˆูŠูˆุฑูƒ. ุชูˆุถูŠุญ ุฃุณู…ุงุก ุฃู…ุงูƒู† > ุฃุจูˆ ุฅุจุฑุงู‡ูŠู… ุงู„ูุงุฑุงุจูŠ ุฃุฏูŠุจ ู†ุญูˆูŠ ู„ุบูˆูŠ ุฃุจูˆ ู†ุตุฑ ู…ุญู…ุฏ ุงู„ูุงุฑุงุจูŠ ููŠู„ุณูˆู ู…ุดุงุฆูŠ ู…ุณู„ู… ูˆุทุจูŠุจ > ุฅุณุญุงู‚ ู†ูŠูˆุชู† ุนุงู„ู… ุฅู†ุฌู„ูŠุฒูŠ ู†ูŠูˆุชู† ูˆุญุฏุฉ ู‚ูŠุงุณ ุงู„ู‚ูˆุฉ. ุฐูƒูˆุฑ ุฅู†ุฌู„ูŠุฒูŠุฉ ุชูˆุถูŠุญ ุฃุณู…ุงุก ุฃู…ุงูƒู† > ุจูˆุชุงู† (ู…ู…ู„ูƒุฉ) ุจูˆุชุงู† ู…ู…ู„ูƒุฉ ููŠ ุฌุจุงู„ ุงู„ู‡ู…ุงู„ุงูŠุง ุจูŠู† ุงู„ู‡ู†ุฏ ูˆุงู„ุตูŠู†. ุจูˆุชุงู† (ูƒูŠู…ูŠุงุก) ุฃุญุฏ ุงู„ุฃู„ูƒุงู†ุงุชุŒ ูŠุชูƒูˆู† ู…ู† ุฃุฑุจุน ุฐุฑุงุช ูƒุฑุจูˆู†. ## Quick Start ### Load the Tokenizer ```python import sentencepiece as spm sp = spm.SentencePieceProcessor() sp.Load("ar_tokenizer_32k.model") text = "ุงุณุชูˆุฏูŠูˆู‡ุงุช ุฃูู„ุงู… ูˆุงู„ุช ุฏูŠุฒู†ูŠ ุฃูู„ุงู… ูˆุงู„ุช ุฏูŠุฒู†ูŠ ู…ู†ุชุฌุน ูˆุงู„ุช ุฏูŠุฒู†ูŠ ุงู„ุนุงู„ู…ูŠ ุฏูŠุฒู†ูŠ ู„ุงู†ุฏ" tokens = sp.EncodeAsPieces(text) ids = sp.EncodeAsIds(text) print(tokens) # subword pieces print(ids) # integer ids # Decode back print(sp.DecodeIds(ids)) ```
Tokenization examples (click to expand) **Sample 1:** `ุงุณุชูˆุฏูŠูˆู‡ุงุช ุฃูู„ุงู… ูˆุงู„ุช ุฏูŠุฒู†ูŠ ุฃูู„ุงู… ูˆุงู„ุช ุฏูŠุฒู†ูŠ ู…ู†ุชุฌุน ูˆุงู„ุช ุฏูŠุฒู†ูŠ ุงู„ุนุงู„ู…ูŠ ุฏูŠุฒู†ูŠ ู„ุงู†ุฏโ€ฆ` | Vocab | Tokens | Count | |-------|--------|-------| | 8k | `โ–ุงุณุช ูˆุฏูŠ ูˆู‡ ุงุช โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠ ุฒ ู†ูŠ โ–ุฃูู„ุงู… โ€ฆ (+22 more)` | 32 | | 16k | `โ–ุงุณุช ูˆุฏูŠ ูˆู‡ุงุช โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ู…ู†ุช โ€ฆ (+10 more)` | 20 | | 32k | `โ–ุงุณุชูˆุฏูŠูˆู‡ุงุช โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ู…ู†ุชุฌุน โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ€ฆ (+7 more)` | 17 | | 64k | `โ–ุงุณุชูˆุฏูŠูˆู‡ุงุช โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ุฃูู„ุงู… โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ–ู…ู†ุชุฌุน โ–ูˆุงู„ุช โ–ุฏูŠุฒู†ูŠ โ€ฆ (+7 more)` | 17 | **Sample 2:** `ุจุงุณูƒุงู„ ู‚ุฏ ุชุนู†ูŠ: ุงู„ุจุงุณูƒุงู„ุŒ ูˆุญุฏุฉ ู‚ูŠุงุณ ุงู„ุถุบุท ู„ุบุฉ ุจุงุณูƒุงู„ุŒ ู„ุบุฉ ุจุฑู…ุฌุฉ ุงู„ููŠู„ุณูˆู ุจุงุณูƒุงู„ุŒโ€ฆ` | Vocab | Tokens | Count | |-------|--------|-------| | 8k | `โ–ุจุง ุณูƒ ุงู„ โ–ู‚ุฏ โ–ุชุนู†ูŠ : โ–ุงู„ุจุง ุณูƒ ุงู„ ุŒ โ€ฆ (+29 more)` | 39 | | 16k | `โ–ุจุงุณูƒุงู„ โ–ู‚ุฏ โ–ุชุนู†ูŠ : โ–ุงู„ุจุงุณูƒ ุงู„ ุŒ โ–ูˆุญุฏุฉ โ–ู‚ูŠุงุณ โ–ุงู„ุถุบุท โ€ฆ (+18 more)` | 28 | | 32k | `โ–ุจุงุณูƒุงู„ โ–ู‚ุฏ โ–ุชุนู†ูŠ : โ–ุงู„ุจุงุณูƒ ุงู„ ุŒ โ–ูˆุญุฏุฉ โ–ู‚ูŠุงุณ โ–ุงู„ุถุบุท โ€ฆ (+15 more)` | 25 | | 64k | `โ–ุจุงุณูƒุงู„ โ–ู‚ุฏ โ–ุชุนู†ูŠ : โ–ุงู„ุจุงุณูƒ ุงู„ ุŒ โ–ูˆุญุฏุฉ โ–ู‚ูŠุงุณ โ–ุงู„ุถุบุท โ€ฆ (+15 more)` | 25 | **Sample 3:** `ุฌู…ู‡ูˆุฑูŠุฉ ุงู„ูƒูˆู†ุบูˆ ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉุŒ ุฒุงุฆูŠุฑ ุณุงุจู‚ู‹ุงุŒ ุนุงุตู…ุชู‡ุง ูƒูŠู†ุดุงุณุง. ุฌู…ู‡ูˆุฑูŠุฉ ุงู„ูƒูˆู†ุบูˆุŒ ุนุงุตโ€ฆ` | Vocab | Tokens | Count | |-------|--------|-------| | 8k | `โ–ุฌู…ู‡ูˆุฑูŠุฉ โ–ุงู„ูƒูˆู† ุบูˆ โ–ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉ ุŒ โ–ุฒ ุงุฆ ูŠุฑ โ–ุณุงุจู‚ ู‹ุง โ€ฆ (+21 more)` | 31 | | 16k | `โ–ุฌู…ู‡ูˆุฑูŠุฉ โ–ุงู„ูƒูˆู†ุบูˆ โ–ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉ ุŒ โ–ุฒ ุงุฆ ูŠุฑ โ–ุณุงุจู‚ู‹ุง ุŒ โ–ุนุงุตู…ุชู‡ุง โ€ฆ (+16 more)` | 26 | | 32k | `โ–ุฌู…ู‡ูˆุฑูŠุฉ โ–ุงู„ูƒูˆู†ุบูˆ โ–ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉ ุŒ โ–ุฒุงุฆ ูŠุฑ โ–ุณุงุจู‚ู‹ุง ุŒ โ–ุนุงุตู…ุชู‡ุง โ–ูƒูŠู†ุดุงุณุง โ€ฆ (+12 more)` | 22 | | 64k | `โ–ุฌู…ู‡ูˆุฑูŠุฉ โ–ุงู„ูƒูˆู†ุบูˆ โ–ุงู„ุฏูŠู…ู‚ุฑุงุทูŠุฉ ุŒ โ–ุฒุงุฆูŠุฑ โ–ุณุงุจู‚ู‹ุง ุŒ โ–ุนุงุตู…ุชู‡ุง โ–ูƒูŠู†ุดุงุณุง . โ€ฆ (+10 more)` | 20 |
### Load Word Embeddings ```python from gensim.models import KeyedVectors # Aligned embeddings (cross-lingual, mapped to English vector space) wv = KeyedVectors.load("ar_embeddings_128d_aligned.kv") similar = wv.most_similar("word", topn=5) for word, score in similar: print(f" {word}: {score:.3f}") ``` ### Load N-gram Model ```python import pyarrow.parquet as pq df = pq.read_table("ar_3gram_word.parquet").to_pandas() print(df.head()) ``` ## Models Overview ![Performance Dashboard](visualizations/performance_dashboard.png) | Category | Assets | |----------|--------| | Tokenizers | BPE at 8k, 16k, 32k, 64k vocab sizes | | N-gram models | 2 / 3 / 4 / 5-gram (word & subword) | | Markov chains | Context 1โ€“5 (word & subword) | | Embeddings | 32d, 64d, 128d โ€” mono & aligned | | Vocabulary | Full frequency list + Zipf analysis | | Statistics | Corpus & model statistics JSON | ## Metrics Summary | Component | Model | Key Metric | Value | |-----------|-------|------------|-------| | Tokenizer | 8k BPE | Compression | 3.25x | | Tokenizer | 16k BPE | Compression | 3.65x | | Tokenizer | 32k BPE | Compression | 4.03x | | Tokenizer | 64k BPE | Compression | 4.35x ๐Ÿ† | | N-gram | 2-gram (subword) | Perplexity | 426 ๐Ÿ† | | N-gram | 2-gram (word) | Perplexity | 359,826 | | N-gram | 3-gram (subword) | Perplexity | 4,163 | | N-gram | 3-gram (word) | Perplexity | 775,988 | | N-gram | 4-gram (subword) | Perplexity | 27,277 | | N-gram | 4-gram (word) | Perplexity | 1,494,234 | | N-gram | 5-gram (subword) | Perplexity | 133,736 | | N-gram | 5-gram (word) | Perplexity | 1,059,510 | | Markov | ctx-1 (subword) | Predictability | 0.0% | | Markov | ctx-1 (word) | Predictability | 0.0% | | Markov | ctx-2 (subword) | Predictability | 17.3% | | Markov | ctx-2 (word) | Predictability | 67.4% | | Markov | ctx-3 (subword) | Predictability | 29.5% | | Markov | ctx-3 (word) | Predictability | 89.5% | | Markov | ctx-4 (subword) | Predictability | 35.2% | | Markov | ctx-4 (word) | Predictability | 96.5% ๐Ÿ† | | Vocabulary | full | Size | 986,324 | | Vocabulary | full | Zipf Rยฒ | 0.9920 | | Embeddings | mono_32d | Isotropy | 0.8111 | | Embeddings | mono_64d | Isotropy | 0.7841 | | Embeddings | mono_128d | Isotropy | 0.7556 | | Embeddings | aligned_32d | Isotropy | 0.8111 ๐Ÿ† | | Embeddings | aligned_64d | Isotropy | 0.7841 | | Embeddings | aligned_128d | Isotropy | 0.7556 | | Alignment | aligned_32d | R@1 / R@5 / R@10 | 13.4% / 35.0% / 48.6% | | Alignment | aligned_64d | R@1 / R@5 / R@10 | 28.6% / 54.0% / 65.6% | | Alignment | aligned_128d | R@1 / R@5 / R@10 | 37.2% / 65.0% / 76.6% ๐Ÿ† | ๐Ÿ“Š **[Full ablation study, per-model breakdowns, and interpretation guide โ†’](RESEARCH_REPORT.md)** --- ## About Trained on [wikipedia-monthly](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) โ€” monthly snapshots of 300+ Wikipedia languages. A project by **[Wikilangs](https://wikilangs.org)** ยท Maintainer: [Omar Kamali](https://omarkamali.com) ยท [Omneity Labs](https://omneitylabs.com) ### Citation ```bibtex @misc{wikilangs2025, author = {Kamali, Omar}, title = {Wikilangs: Open NLP Models for Wikipedia Languages}, year = {2025}, doi = {10.5281/zenodo.18073153}, publisher = {Zenodo}, url = {https://huggingface.co/wikilangs}, institution = {Omneity Labs} } ``` ### Links - ๐ŸŒ [wikilangs.org](https://wikilangs.org) - ๐ŸŒ [Language page](https://wikilangs.org/languages/ar/) - ๐ŸŽฎ [Playground](https://wikilangs.org/playground/?lang=ar) - ๐Ÿค— [HuggingFace models](https://huggingface.co/wikilangs) - ๐Ÿ“Š [wikipedia-monthly dataset](https://huggingface.co/datasets/omarkamali/wikipedia-monthly) - ๐Ÿ‘ค [Omar Kamali](https://huggingface.co/omarkamali) - ๐Ÿค Sponsor: [Featherless AI](https://featherless.ai) **License:** MIT โ€” free for academic and commercial use. --- *Generated by Wikilangs Pipeline ยท 2026-03-04 13:56:39*