AI & ML interests

Bashkir language, Turkic NLP, low-resource languages, computational linguistics, machine translation, speech technologies, language models, linguistic resources

Recent Activity

ArabovMK  updated a Space about 18 hours ago
BashkirNLPWorld/README
ArabovMK  published a Space about 18 hours ago
BashkirNLPWorld/README
View all activity

Organization Card

BashkirNLP – Turkic & Low‑Resource Languages Research Hub

Status Focus Focus Focus

BashkirNLP is a collaborative research initiative dedicated to advancing natural language processing for Bashkir, Turkic languages, and low‑resource languages in the Ural-Volga region and beyond. We develop state‑of‑the‑art language models, machine translation systems, linguistic resources, and educational tools to empower under‑represented languages in the digital age.


🎯 Our Mission

  • Build open‑source language models for Bashkir and other Turkic varieties (Tatar, Kazakh, Chuvash, etc.).
  • Create high‑quality linguistic resources (corpora, lexicons, evaluation benchmarks).
  • Advance machine translation between Bashkir, Russian, English, and major Turkic languages.
  • Develop educational materials and interactive demos to lower the entry barrier for low‑resource NLP.
  • Foster a community of researchers, developers, and native speakers working together on language technology.

🚀 Interactive Demos

Explore our live Hugging Face Spaces and try out our models directly in your browser:

🔤 Language Models

🌐 Machine Translation

📚 Linguistic Tools

📊 Data & Benchmarks

Click on any demo to start experimenting – no installation required!


🧠 Research Focus Areas

🏞️ Bashkir Language Technologies

  • Creation of the first large‑scale pretrained models for Bashkir (Cyrillic script, with Latin adaptation).
  • Morphological disambiguation and syntactic parsing for Bashkir (agglutinative morphology).
  • Speech recognition and synthesis for Bashkir (coming soon).

📜 Turkic NLP

  • Cross‑lingual transfer learning among Bashkir, Tatar, Kazakh, and other Kipchak languages.
  • Unified tokenization and subword models for the Turkic language family.
  • Machine translation between Turkic languages and major world languages.

📉 Low‑Resource NLP

  • Data augmentation and semi‑supervised learning techniques.
  • Leveraging multilingual models (e.g., mT5, XLM‑R, Turkmenglish) for under‑represented languages.
  • Few‑shot and zero‑shot learning for tasks like NER and sentiment analysis.

🤖 Language Models

  • Pretraining from scratch and continued pretraining on Bashkir/Turkic corpora.
  • Efficient architectures (ALBERT, DistilBERT) for low‑resource settings.
  • Evaluation and bias analysis of Turkic language models.

📖 Linguistic Resources

  • Corpora: News, literature, web‑crawled texts, social media (e.g., VK, Telegram).
  • Lexicons: Morphological dictionaries, wordnets, sentiment lexicons.
  • Benchmarks: Named entity recognition, part‑of‑speech tagging, machine translation test sets.

📦 Models & Datasets

We release all our models and datasets on Hugging Face Hub under open licenses.

Model / Dataset Description Link
BashkirBERT BERT‑base model pretrained on Bashkir Cyrillic texts 🤗 Hub
Turkic‑mT5 Multilingual T5 fine‑tuned on Bashkir, Tatar, and Kazakh 🤗 Hub
Bashkir‑MT‑BaRu Transformer‑based translation model (Bashkir ↔ Russian) 🤗 Hub
Bashkir‑NER Named entity recognition model for Bashkir 🤗 Hub
BashkirCorpus v1.0 100M token corpus from news, books, and websites 🤗 Dataset
Turkic‑Parallel‑Bench Parallel sentences for Bashkir, Tatar, and Turkish 🤗 Dataset

More models and datasets are added regularly. Follow our organization page for updates.


📚 Educational Resources

We believe in open education and reproducible research. All our tutorials and teaching materials are freely available.

  • Interactive Notebooks – Hands‑on tutorials for building low‑resource NLP systems (in Python, using Hugging Face libraries).
  • Video Lectures – Recorded talks on Bashkir/Turkic NLP, data collection, and model training.
  • Course Materials – Slides, readings, and assignments from our university courses.
  • Blog Posts – Deep dives into challenges and solutions for Bashkir and Turkic languages.

📝 Selected Publications

  1. "BashkirBERT: A Pretrained Language Model for Bashkir" – LREC 2025 (planned)
  2. "Machine Translation for Low-Resource Turkic Languages: Bashkir–Russian Case Study" – WMT 2024
  3. "Building a Named Entity Recognition Dataset for Bashkir" – TurkicLang 2024
  4. "Multilingual Representations for Kipchak Languages: A Comparative Study" – EMNLP 2023
  5. "Bashkir Corpus: Collection, Annotation, and Baseline Experiments" – Dialogue 2023

Full list with links to PDFs available on our Publications Page.


🤝 Get Involved

We welcome contributions from the community – whether you are a researcher, developer, student, or native speaker.

For Researchers

  • Use our models and datasets in your work (and cite us!).
  • Collaborate on joint papers and grant proposals.
  • Contribute new benchmarks or evaluation tasks.

For Developers

  • Integrate our models into your applications.
  • Report bugs or suggest improvements via GitHub Issues.
  • Submit pull requests to our open‑source repositories.

For Native Speakers & Linguists

  • Help us validate translations and annotations.
  • Share texts or corpora (with permission) to enrich our data.
  • Provide feedback on model outputs to reduce errors.

For Students

  • Use our demos and tutorials for learning.
  • Participate in our mentorship program or summer schools.
  • Start your own research project with our support.

🌐 Connect With Us


🔄 Ecosystem Integration

Our work is integrated with the broader Hugging Face ecosystem:

  • Models on the Hub with easy‑to‑use pipelines.
  • Datasets with streaming and evaluation scripts.
  • Spaces for interactive demos and educational tools.
  • Gradio apps for user‑friendly interfaces.

Empowering Bashkir and Turkic languages through open science and community collaboration.

Hugging Face GitHub Twitter

© 2026 BashkirNLP – Open source for low‑resource languages.

models 0

None public yet

datasets 0

None public yet