Ossetian NLP Community / Natural Language Processing for Ossetian Language

community

AI & ML interests

Ossetian language, Iranian languages NLP, low-resource languages, computational linguistics, machine translation, speech technologies, language models, corpus creation, digital language preservation

Recent Activity

ArabovMK updated a Space about 21 hours ago

OssetianNLPWorld/README

ArabovMK published a Space about 21 hours ago

OssetianNLPWorld/README

View all activity

Organization Card

Community About org cards

OssetianNLP – Iranian & Low‑Resource Languages Research Hub

OssetianNLP is a collaborative research initiative dedicated to advancing natural language processing for Ossetian (Ирон æвзаг) , Iranian languages, and low‑resource languages of the Caucasus and beyond. We develop state‑of‑the‑art language models, machine translation systems, linguistic resources, and educational tools to empower under‑represented languages in the digital age.

🎯 Our Mission

Build open‑source language models for Ossetian and other Iranian languages (Persian, Tajik, Pashto, Kurdish).
Create high‑quality linguistic resources (corpora, lexicons, evaluation benchmarks) for Ossetian in both Cyrillic and Latin scripts.
Advance machine translation between Ossetian, Russian, English, and related Iranian languages.
Develop educational materials and interactive demos to lower the entry barrier for low‑resource NLP.
Preserve and digitize the Ossetian language and cultural heritage through language technology.
Foster a community of researchers, developers, and native speakers working together on language technology.

🚀 Interactive Demos

Explore our live Hugging Face Spaces and try out our models directly in your browser:

🔤 Language Models

OssetianGPT Playground – Generate and analyze Ossetian text with our latest causal LM.
IranianBERT Explorer – Masked language modelling for Ossetian, Persian, and Tajik.
Multilingual Embeddings – Compare word/sentence vectors across Iranian languages.

🌐 Machine Translation

Ossetian ↔ Russian Translator – Neural translation between Ossetian and Russian (most practical pair).
Ossetian ↔ English Translator – Experimental translation for academic and international use.
Iranian Languages Translation – Translate between Ossetian, Persian, and Tajik.

📚 Linguistic Tools

Ossetian Morphological Analyzer – Interactive segmentation and POS tagging (agglutinative morphology with Iranian roots).
Named Entity Recognition for Ossetian – Identify persons, locations, organizations (including Nart epic characters!).
Script Converter – Convert between Cyrillic Ossetian and historical Latin/Caucasian scripts.

🗣️ Speech Technologies

Ossetian Text‑to‑Speech – Listen to Ossetian text synthesized with our latest TTS models.
Ossetian Speech Recognition – Transcribe spoken Ossetian (Iron and Digor dialects).
Dialect Identifier – Distinguish between Iron and Digor dialects.

📊 Data & Benchmarks

Ossetian Corpus Explorer – Browse and query our curated text collections (literature, news, folklore).
Iranian NLP Leaderboard – Compare model performance on Ossetian, Persian, and other Iranian tasks.
Annotation Tools – Help us improve datasets with your feedback.
Nart Epic Explorer – Search and analyze the Nart sagas in Ossetian with NLP tools.

Click on any demo to start experimenting – no installation required!

🧠 Research Focus Areas

⛰️ Ossetian Language Technologies

Creation of the first large‑scale pretrained models for Ossetian (Cyrillic script).
Morphological disambiguation and syntactic parsing for Ossetian (complex agglutinative morphology with Iranian roots).
Handling dialectal variation between Iron (standard) and Digor.
Speech recognition and synthesis for both dialects.

📜 Iranian NLP

Cross‑lingual transfer learning among Ossetian, Persian, Tajik, and Pashto.
Unified tokenization and subword models for the Eastern and Western Iranian language branches.
Machine translation between Iranian languages and major world languages.

🏔️ Caucasian Language Context

Leveraging Ossetian as a bridge between Iranian and Caucasian language families.
Multilingual models for the Caucasus region (Ossetian, Georgian, Chechen, Abkhaz, etc.).
Areal linguistics and language contact phenomena.

📉 Low‑Resource NLP

Data augmentation and semi‑supervised learning techniques.
Leveraging multilingual models (e.g., mT5, XLM‑R) for under‑represented languages.
Few‑shot and zero‑shot learning for tasks like NER and sentiment analysis.

🤖 Language Models

Pretraining from scratch and continued pretraining on Ossetian/Iranian corpora.
Efficient architectures (ALBERT, DistilBERT) for low‑resource settings.
Evaluation and bias analysis of language models for Iranian languages.

📖 Linguistic Resources

Corpora: News (Iron.fm, Ossetia.ru), literature (Kosta Khetagurov, Nart sagas), web‑crawled texts, social media (VK, Telegram).
Lexicons: Morphological dictionaries, wordnets, sentiment lexicons.
Benchmarks: Named entity recognition, part‑of‑speech tagging, machine translation test sets.
Parallel Data: Ossetian‑Russian, Ossetian‑English, Ossetian‑Persian aligned texts.

📦 Models & Datasets

We release all our models and datasets on Hugging Face Hub under open licenses.

Model / Dataset	Description	Link
OssetianBERT	BERT‑base model pretrained on Ossetian Cyrillic texts (news, literature, web)	🤗 Hub
Iranian‑mT5	Multilingual T5 fine‑tuned on Ossetian, Persian, and Tajik	🤗 Hub
Ossetian‑MT‑OsRu	Transformer‑based translation model (Ossetian ↔ Russian)	🤗 Hub
Ossetian‑NER	Named entity recognition model for Ossetian	🤗 Hub
Ossetian‑TTS	Text‑to‑speech model for Iron dialect	🤗 Hub
OssetianCorpus v1.0	50M token corpus from news, books, Nart sagas, and websites	🤗 Dataset
Ossetian‑Russian‑Parallel	500k sentence pairs for machine translation	🤗 Dataset
NartEpic‑Dataset	Annotated Nart sagas with character names and locations	🤗 Dataset

More models and datasets are added regularly. Follow our organization page for updates.

📚 Educational Resources

We believe in open education and reproducible research. All our tutorials and teaching materials are freely available.

Interactive Notebooks – Hands‑on tutorials for building low‑resource NLP systems (in Python, using Hugging Face libraries) with Ossetian examples.
Video Lectures – Recorded talks on Ossetian/Iranian NLP, data collection, and model training (in Ossetian, Russian, and English).
Course Materials – Slides, readings, and assignments from our university courses.
Blog Posts – Deep dives into challenges and solutions for Ossetian and Iranian languages.
Ossetian NLP Handbook – Comprehensive guide to processing the Ossetian language (coming soon).

📝 Selected Publications

"OssetianBERT: A Pretrained Language Model for Ossetian in Cyrillic Script" – LREC 2025 (planned)
"Low-Resource Machine Translation for Ossetian: Challenges and Solutions" – WMT 2024
"Building the First Named Entity Recognition Dataset for Ossetian" – IranNLP 2024
"Handling Dialectal Variation in Ossetian ASR: Iron vs. Digor" – Interspeech 2024
"Ossetian Corpus: Collection, Annotation, and Baseline Experiments" – Dialogue 2023
"Nart Epic as a Linguistic Resource: Named Entities in Ossetian Folklore" – Digital Humanities 2023

Full list with links to PDFs available on our Publications Page.

🤝 Get Involved

We welcome contributions from the community – whether you are a researcher, developer, student, or native speaker.

For Researchers

Use our models and datasets in your work (and cite us!).
Collaborate on joint papers and grant proposals.
Contribute new benchmarks or evaluation tasks.
Study Ossetian from computational and linguistic perspectives.

For Developers

Integrate our models into your applications.
Report bugs or suggest improvements via GitHub Issues.
Submit pull requests to our open‑source repositories.
Help build mobile apps and tools for Ossetian speakers.

For Native Speakers & Linguists

Help us validate translations and annotations.
Share texts or corpora (with permission) to enrich our data.
Provide feedback on model outputs to reduce errors.
Contribute knowledge about Iron/Digor dialect differences.
Record speech samples for TTS and ASR datasets.

For Students

Use our demos and tutorials for learning.
Participate in our mentorship program or summer schools.
Start your own research project with our support.
Join our student ambassador program.

For Cultural Organizations

Partner with us on digital preservation projects.
Provide access to digitized texts and archives.
Help us engage with the Ossetian-speaking community.

🌐 Connect With Us

🤗 Hugging Face: OssetianNLP – Models, datasets, and spaces.
💻 GitHub: OssetianNLP – Source code, development, and issue tracking.
📧 Email: contact@ossetiannlp.org – General inquiries and collaboration.
🐦 Twitter/X: @OssetianNLP – News and updates (English/Russian).
📘 Facebook: OssetianNLP – Community group for Ossetian speakers.
📱 Telegram: @OssetianNLP – Channel for updates in Russian and Ossetian.
📝 Blog: Medium/OssetianNLP – In‑depth articles.
🎥 YouTube: OssetianNLP Channel – Tutorials and lectures.

🔄 Ecosystem Integration

Our work is integrated with the broader Hugging Face and scientific Python ecosystems:

Models on the Hub with easy‑to‑use pipelines.
Datasets with streaming and evaluation scripts.
Spaces for interactive demos and educational tools.
Gradio apps for user‑friendly interfaces.
Integration with libraries like transformers, datasets, speechbrain, espnet.
Compatibility with Iranian NLP tools and resources.

🌟 Success Stories

Ossetian Wikipedia now uses our models for article quality assessment.
Local news outlets in North and South Ossetia employ our transcription tools.
Schools in Vladikavkaz and Tskhinvali use our educational demos.
Researchers from 12 countries have cited our datasets in their work.
Nart epic now searchable and analyzable with our NLP tools.

Æвзаг — адæмы уды ныфс. Бахайр кæнæм ирон æвзаг цифрон дунейы!
(Язык — душа народа. Сохраним осетинский язык в цифровом мире!)

📋 Roadmap

Quarter	Milestone
Q1 2026	Release OssetianBERT‑base and first corpus v1.0
Q2 2026	Launch MT system Ossetian‑Russian with Gradio demo
Q3 2026	Add Digor dialect support and dialect identification
Q4 2026	Release TTS and ASR prototypes
Q1 2027	Organize first Ossetian NLP workshop

Want to help us achieve these goals? Get involved!

💝 Support Us

OssetianNLP is a community-driven, non-profit initiative. You can support us by:

Contributing code, data, or documentation
Sharing our resources with your network
Translating our materials into Ossetian, Russian, or English
Donating computing resources (contact us for details)
Becoming a sponsor for workshops and events

Фæрæзтæй дæр, зондæй дæр – алы хуызы æххуыс дæр нын арф кæнæ!
(И силой, и знанием – любая помощь нам дорога!)

models 0

None public yet

datasets 0