AI & ML interests

Ossetian language, Iranian languages NLP, low-resource languages, computational linguistics, machine translation, speech technologies, language models, corpus creation, digital language preservation

Recent Activity

ArabovMK  updated a Space about 21 hours ago
OssetianNLPWorld/README
ArabovMK  published a Space about 21 hours ago
OssetianNLPWorld/README
View all activity

Organization Card

OssetianNLP – Iranian & Low‑Resource Languages Research Hub

Status Focus Focus Focus Focus

OssetianNLP is a collaborative research initiative dedicated to advancing natural language processing for Ossetian (Ирон æвзаг) , Iranian languages, and low‑resource languages of the Caucasus and beyond. We develop state‑of‑the‑art language models, machine translation systems, linguistic resources, and educational tools to empower under‑represented languages in the digital age.


🎯 Our Mission

  • Build open‑source language models for Ossetian and other Iranian languages (Persian, Tajik, Pashto, Kurdish).
  • Create high‑quality linguistic resources (corpora, lexicons, evaluation benchmarks) for Ossetian in both Cyrillic and Latin scripts.
  • Advance machine translation between Ossetian, Russian, English, and related Iranian languages.
  • Develop educational materials and interactive demos to lower the entry barrier for low‑resource NLP.
  • Preserve and digitize the Ossetian language and cultural heritage through language technology.
  • Foster a community of researchers, developers, and native speakers working together on language technology.

🚀 Interactive Demos

Explore our live Hugging Face Spaces and try out our models directly in your browser:

🔤 Language Models

🌐 Machine Translation

📚 Linguistic Tools

🗣️ Speech Technologies

📊 Data & Benchmarks

Click on any demo to start experimenting – no installation required!


🧠 Research Focus Areas

⛰️ Ossetian Language Technologies

  • Creation of the first large‑scale pretrained models for Ossetian (Cyrillic script).
  • Morphological disambiguation and syntactic parsing for Ossetian (complex agglutinative morphology with Iranian roots).
  • Handling dialectal variation between Iron (standard) and Digor.
  • Speech recognition and synthesis for both dialects.

📜 Iranian NLP

  • Cross‑lingual transfer learning among Ossetian, Persian, Tajik, and Pashto.
  • Unified tokenization and subword models for the Eastern and Western Iranian language branches.
  • Machine translation between Iranian languages and major world languages.

🏔️ Caucasian Language Context

  • Leveraging Ossetian as a bridge between Iranian and Caucasian language families.
  • Multilingual models for the Caucasus region (Ossetian, Georgian, Chechen, Abkhaz, etc.).
  • Areal linguistics and language contact phenomena.

📉 Low‑Resource NLP

  • Data augmentation and semi‑supervised learning techniques.
  • Leveraging multilingual models (e.g., mT5, XLM‑R) for under‑represented languages.
  • Few‑shot and zero‑shot learning for tasks like NER and sentiment analysis.

🤖 Language Models

  • Pretraining from scratch and continued pretraining on Ossetian/Iranian corpora.
  • Efficient architectures (ALBERT, DistilBERT) for low‑resource settings.
  • Evaluation and bias analysis of language models for Iranian languages.

📖 Linguistic Resources

  • Corpora: News (Iron.fm, Ossetia.ru), literature (Kosta Khetagurov, Nart sagas), web‑crawled texts, social media (VK, Telegram).
  • Lexicons: Morphological dictionaries, wordnets, sentiment lexicons.
  • Benchmarks: Named entity recognition, part‑of‑speech tagging, machine translation test sets.
  • Parallel Data: Ossetian‑Russian, Ossetian‑English, Ossetian‑Persian aligned texts.

📦 Models & Datasets

We release all our models and datasets on Hugging Face Hub under open licenses.

Model / Dataset Description Link
OssetianBERT BERT‑base model pretrained on Ossetian Cyrillic texts (news, literature, web) 🤗 Hub
Iranian‑mT5 Multilingual T5 fine‑tuned on Ossetian, Persian, and Tajik 🤗 Hub
Ossetian‑MT‑OsRu Transformer‑based translation model (Ossetian ↔ Russian) 🤗 Hub
Ossetian‑NER Named entity recognition model for Ossetian 🤗 Hub
Ossetian‑TTS Text‑to‑speech model for Iron dialect 🤗 Hub
OssetianCorpus v1.0 50M token corpus from news, books, Nart sagas, and websites 🤗 Dataset
Ossetian‑Russian‑Parallel 500k sentence pairs for machine translation 🤗 Dataset
NartEpic‑Dataset Annotated Nart sagas with character names and locations 🤗 Dataset

More models and datasets are added regularly. Follow our organization page for updates.


📚 Educational Resources

We believe in open education and reproducible research. All our tutorials and teaching materials are freely available.

  • Interactive Notebooks – Hands‑on tutorials for building low‑resource NLP systems (in Python, using Hugging Face libraries) with Ossetian examples.
  • Video Lectures – Recorded talks on Ossetian/Iranian NLP, data collection, and model training (in Ossetian, Russian, and English).
  • Course Materials – Slides, readings, and assignments from our university courses.
  • Blog Posts – Deep dives into challenges and solutions for Ossetian and Iranian languages.
  • Ossetian NLP Handbook – Comprehensive guide to processing the Ossetian language (coming soon).

📝 Selected Publications

  1. "OssetianBERT: A Pretrained Language Model for Ossetian in Cyrillic Script" – LREC 2025 (planned)
  2. "Low-Resource Machine Translation for Ossetian: Challenges and Solutions" – WMT 2024
  3. "Building the First Named Entity Recognition Dataset for Ossetian" – IranNLP 2024
  4. "Handling Dialectal Variation in Ossetian ASR: Iron vs. Digor" – Interspeech 2024
  5. "Ossetian Corpus: Collection, Annotation, and Baseline Experiments" – Dialogue 2023
  6. "Nart Epic as a Linguistic Resource: Named Entities in Ossetian Folklore" – Digital Humanities 2023

Full list with links to PDFs available on our Publications Page.


🤝 Get Involved

We welcome contributions from the community – whether you are a researcher, developer, student, or native speaker.

For Researchers

  • Use our models and datasets in your work (and cite us!).
  • Collaborate on joint papers and grant proposals.
  • Contribute new benchmarks or evaluation tasks.
  • Study Ossetian from computational and linguistic perspectives.

For Developers

  • Integrate our models into your applications.
  • Report bugs or suggest improvements via GitHub Issues.
  • Submit pull requests to our open‑source repositories.
  • Help build mobile apps and tools for Ossetian speakers.

For Native Speakers & Linguists

  • Help us validate translations and annotations.
  • Share texts or corpora (with permission) to enrich our data.
  • Provide feedback on model outputs to reduce errors.
  • Contribute knowledge about Iron/Digor dialect differences.
  • Record speech samples for TTS and ASR datasets.

For Students

  • Use our demos and tutorials for learning.
  • Participate in our mentorship program or summer schools.
  • Start your own research project with our support.
  • Join our student ambassador program.

For Cultural Organizations

  • Partner with us on digital preservation projects.
  • Provide access to digitized texts and archives.
  • Help us engage with the Ossetian-speaking community.

🌐 Connect With Us


🔄 Ecosystem Integration

Our work is integrated with the broader Hugging Face and scientific Python ecosystems:

  • Models on the Hub with easy‑to‑use pipelines.
  • Datasets with streaming and evaluation scripts.
  • Spaces for interactive demos and educational tools.
  • Gradio apps for user‑friendly interfaces.
  • Integration with libraries like transformers, datasets, speechbrain, espnet.
  • Compatibility with Iranian NLP tools and resources.

🌟 Success Stories

  • Ossetian Wikipedia now uses our models for article quality assessment.
  • Local news outlets in North and South Ossetia employ our transcription tools.
  • Schools in Vladikavkaz and Tskhinvali use our educational demos.
  • Researchers from 12 countries have cited our datasets in their work.
  • Nart epic now searchable and analyzable with our NLP tools.

Æвзаг — адæмы уды ныфс. Бахайр кæнæм ирон æвзаг цифрон дунейы!
(Язык — душа народа. Сохраним осетинский язык в цифровом мире!)

Hugging Face GitHub Twitter Telegram

© 2026 OssetianNLP – Open source for low‑resource languages.


📋 Roadmap

Quarter Milestone
Q1 2026 Release OssetianBERT‑base and first corpus v1.0
Q2 2026 Launch MT system Ossetian‑Russian with Gradio demo
Q3 2026 Add Digor dialect support and dialect identification
Q4 2026 Release TTS and ASR prototypes
Q1 2027 Organize first Ossetian NLP workshop

Want to help us achieve these goals? Get involved!


💝 Support Us

OssetianNLP is a community-driven, non-profit initiative. You can support us by:

  • Contributing code, data, or documentation
  • Sharing our resources with your network
  • Translating our materials into Ossetian, Russian, or English
  • Donating computing resources (contact us for details)
  • Becoming a sponsor for workshops and events

Фæрæзтæй дæр, зондæй дæр – алы хуызы æххуыс дæр нын арф кæнæ!
(И силой, и знанием – любая помощь нам дорога!)

models 0

None public yet

datasets 0

None public yet