AI & ML interests

Tajik language, Persian NLP, low-resource languages, machine translation, language models, linguistic resources

Recent Activity

ArabovMK  updated a Space about 12 hours ago
TajikNLPWorld/README
ArabovMK  published a Space about 12 hours ago
TajikNLPWorld/README
View all activity

Organization Card

TajikNLPWorld – Persian NLP & Low‑Resource Languages Research Hub

Status Focus Focus Focus

TajikNLPWorld is a collaborative research initiative dedicated to advancing natural language processing for Tajik, Persian/Dari, and low‑resource languages in the Iranian language family. We develop state‑of‑the‑art language models, machine translation systems, linguistic resources, and educational tools to empower under‑represented languages in the digital age.


🎯 Our Mission

  • Build open‑source language models for Tajik and other Persian varieties (Dari, Hazaragi).
  • Create high‑quality linguistic resources (corpora, lexicons, evaluation benchmarks).
  • Advance machine translation between Tajik, Persian, Dari, and major world languages.
  • Develop educational materials and interactive demos to lower the entry barrier for low‑resource NLP.
  • Foster a community of researchers, developers, and native speakers working together on language technology.

🚀 Interactive Demos

Explore our live Hugging Face Spaces and try out our models directly in your browser:

🔤 Language Models

🌐 Machine Translation

📚 Linguistic Tools

📊 Data & Benchmarks

Click on any demo to start experimenting – no installation required!


🧠 Research Focus Areas

🏔️ Tajik Language Technologies

  • Creation of the first large‑scale pretrained models for Tajik (both Cyrillic and Perso-Arabic scripts).
  • Morphological disambiguation and syntactic parsing for Tajik.
  • Speech recognition and synthesis for Tajik (coming soon).

📜 Persian & Iranian NLP

  • Cross‑dialectal transfer learning among Tajik, Dari, and Iranian Persian.
  • Unified tokenization and subword models for the Persian language continuum.
  • Machine translation between Persian varieties and major languages.

📉 Low‑Resource NLP

  • Data augmentation and semi‑supervised learning techniques.
  • Leveraging multilingual models (e.g., mT5, XLM‑R) for under‑represented languages.
  • Few‑shot and zero‑shot learning for tasks like NER and sentiment analysis.

🤖 Language Models

  • Pretraining from scratch and continued pretraining on Persian/Tajik corpora.
  • Efficient architectures (ALBERT, DistilBERT) for low‑resource settings.
  • Evaluation and bias analysis of Persian language models.

📖 Linguistic Resources

  • Corpora: News, literature, web‑crawled texts, social media.
  • Lexicons: Morphological dictionaries, wordnets, sentiment lexicons.
  • Benchmarks: Named entity recognition, part‑of‑speech tagging, machine translation test sets.

📦 Models & Datasets

We release all our models and datasets on Hugging Face Hub under open licenses.

Model / Dataset Description Link
TajikBERT BERT‑base model pretrained on Tajik Cyrillic and Perso-Arabic texts 🤗 Hub
Persian‑mT5 Multilingual T5 fine‑tuned on Tajik, Persian, and Dari 🤗 Hub
Tajik‑MT‑TgFa Transformer‑based translation model (Tajik ↔ Persian/Farsi) 🤗 Hub
Tajik‑NER Named entity recognition model for Tajik 🤗 Hub
TajikCorpus v1.0 150M token corpus from news, books, and websites (bilingual scripts) 🤗 Dataset
Persian‑Dialect‑Bench Parallel sentences for Tajik, Dari, and Iranian Persian 🤗 Dataset

More models and datasets are added regularly. Follow our organization page for updates.


📚 Educational Resources

We believe in open education and reproducible research. All our tutorials and teaching materials are freely available.

  • Interactive Notebooks – Hands‑on tutorials for building low‑resource NLP systems (in Python, using Hugging Face libraries).
  • Video Lectures – Recorded talks on Persian/Tajik NLP, data collection, and model training.
  • Course Materials – Slides, readings, and assignments from our university courses.
  • Blog Posts – Deep dives into challenges and solutions for Tajik and Persian languages.

📝 Selected Publications

  1. "TajikBERT: A Pretrained Language Model for Tajik in Cyrillic and Perso-Arabic Scripts" – LREC 2024
  2. "Bridging Dialects: Machine Translation for Tajik, Dari, and Persian" – WMT 2023
  3. "Building a Named Entity Recognition Dataset for Tajik" – IranNLP 2023
  4. "Multilingual Representations for Iranian Languages: A Comparative Study" – EMNLP 2022
  5. "Tajik Corpus: Collection, Annotation, and Baseline Experiments" – Dialogue 2022

Full list with links to PDFs available on our Publications Page.


🤝 Get Involved

We welcome contributions from the community – whether you are a researcher, developer, student, or native speaker.

For Researchers

  • Use our models and datasets in your work (and cite us!).
  • Collaborate on joint papers and grant proposals.
  • Contribute new benchmarks or evaluation tasks.

For Developers

  • Integrate our models into your applications.
  • Report bugs or suggest improvements via GitHub Issues.
  • Submit pull requests to our open‑source repositories.

For Native Speakers & Linguists

  • Help us validate translations and annotations.
  • Share texts or corpora (with permission) to enrich our data.
  • Provide feedback on model outputs to reduce errors.

For Students

  • Use our demos and tutorials for learning.
  • Participate in our mentorship program or summer schools.
  • Start your own research project with our support.

🌐 Connect With Us


🔄 Ecosystem Integration

Our work is integrated with the broader Hugging Face ecosystem:

  • Models on the Hub with easy‑to‑use pipelines.
  • Datasets with streaming and evaluation scripts.
  • Spaces for interactive demos and educational tools.
  • Gradio apps for user‑friendly interfaces.

Empowering Tajik and Persian languages through open science and community collaboration.

Hugging Face GitHub Twitter

© 2026 TajikNLPWorld – Open source for low‑resource languages.

models 0

None public yet

datasets 0

None public yet