AI & ML interests
Bashkir language, Turkic NLP, low-resource languages, computational linguistics, machine translation, speech technologies, language models, linguistic resources
Recent Activity
BashkirNLP – Turkic & Low‑Resource Languages Research Hub
BashkirNLP is a collaborative research initiative dedicated to advancing natural language processing for Bashkir, Turkic languages, and low‑resource languages in the Ural-Volga region and beyond. We develop state‑of‑the‑art language models, machine translation systems, linguistic resources, and educational tools to empower under‑represented languages in the digital age.
🎯 Our Mission
- Build open‑source language models for Bashkir and other Turkic varieties (Tatar, Kazakh, Chuvash, etc.).
- Create high‑quality linguistic resources (corpora, lexicons, evaluation benchmarks).
- Advance machine translation between Bashkir, Russian, English, and major Turkic languages.
- Develop educational materials and interactive demos to lower the entry barrier for low‑resource NLP.
- Foster a community of researchers, developers, and native speakers working together on language technology.
🚀 Interactive Demos
Explore our live Hugging Face Spaces and try out our models directly in your browser:
🔤 Language Models
- BashkirGPT Playground – Generate and analyze Bashkir text with our latest causal LM.
- TurkicBERT Explorer – Masked language modelling for Bashkir, Tatar, and other Turkic languages.
- Multilingual Embeddings – Compare word/sentence vectors across Turkic languages.
🌐 Machine Translation
- Bashkir ↔ Russian Translator – Neural translation between Bashkir and Russian.
- Bashkir ↔ Tatar Translator – Translation demo for closely related Turkic languages.
- Bashkir ↔ English Translator – Experimental translation for low-resource pairs.
📚 Linguistic Tools
- Bashkir Morphological Analyzer – Interactive segmentation and POS tagging (Cyrillic & Latin scripts).
- Named Entity Recognition for Bashkir – Identify persons, locations, organizations.
- Script Converter – Convert between Cyrillic Bashkir and Latin-based orthographies.
📊 Data & Benchmarks
- Bashkir Corpus Explorer – Browse and query our curated text collections.
- Turkic NLP Leaderboard – Compare model performance on Bashkir, Tatar, and other Turkic tasks.
- Annotation Tools – Help us improve datasets with your feedback.
Click on any demo to start experimenting – no installation required!
🧠 Research Focus Areas
🏞️ Bashkir Language Technologies
- Creation of the first large‑scale pretrained models for Bashkir (Cyrillic script, with Latin adaptation).
- Morphological disambiguation and syntactic parsing for Bashkir (agglutinative morphology).
- Speech recognition and synthesis for Bashkir (coming soon).
📜 Turkic NLP
- Cross‑lingual transfer learning among Bashkir, Tatar, Kazakh, and other Kipchak languages.
- Unified tokenization and subword models for the Turkic language family.
- Machine translation between Turkic languages and major world languages.
📉 Low‑Resource NLP
- Data augmentation and semi‑supervised learning techniques.
- Leveraging multilingual models (e.g., mT5, XLM‑R, Turkmenglish) for under‑represented languages.
- Few‑shot and zero‑shot learning for tasks like NER and sentiment analysis.
🤖 Language Models
- Pretraining from scratch and continued pretraining on Bashkir/Turkic corpora.
- Efficient architectures (ALBERT, DistilBERT) for low‑resource settings.
- Evaluation and bias analysis of Turkic language models.
📖 Linguistic Resources
- Corpora: News, literature, web‑crawled texts, social media (e.g., VK, Telegram).
- Lexicons: Morphological dictionaries, wordnets, sentiment lexicons.
- Benchmarks: Named entity recognition, part‑of‑speech tagging, machine translation test sets.
📦 Models & Datasets
We release all our models and datasets on Hugging Face Hub under open licenses.
| Model / Dataset | Description | Link |
|---|---|---|
| BashkirBERT | BERT‑base model pretrained on Bashkir Cyrillic texts | 🤗 Hub |
| Turkic‑mT5 | Multilingual T5 fine‑tuned on Bashkir, Tatar, and Kazakh | 🤗 Hub |
| Bashkir‑MT‑BaRu | Transformer‑based translation model (Bashkir ↔ Russian) | 🤗 Hub |
| Bashkir‑NER | Named entity recognition model for Bashkir | 🤗 Hub |
| BashkirCorpus v1.0 | 100M token corpus from news, books, and websites | 🤗 Dataset |
| Turkic‑Parallel‑Bench | Parallel sentences for Bashkir, Tatar, and Turkish | 🤗 Dataset |
More models and datasets are added regularly. Follow our organization page for updates.
📚 Educational Resources
We believe in open education and reproducible research. All our tutorials and teaching materials are freely available.
- Interactive Notebooks – Hands‑on tutorials for building low‑resource NLP systems (in Python, using Hugging Face libraries).
- Video Lectures – Recorded talks on Bashkir/Turkic NLP, data collection, and model training.
- Course Materials – Slides, readings, and assignments from our university courses.
- Blog Posts – Deep dives into challenges and solutions for Bashkir and Turkic languages.
📝 Selected Publications
- "BashkirBERT: A Pretrained Language Model for Bashkir" – LREC 2025 (planned)
- "Machine Translation for Low-Resource Turkic Languages: Bashkir–Russian Case Study" – WMT 2024
- "Building a Named Entity Recognition Dataset for Bashkir" – TurkicLang 2024
- "Multilingual Representations for Kipchak Languages: A Comparative Study" – EMNLP 2023
- "Bashkir Corpus: Collection, Annotation, and Baseline Experiments" – Dialogue 2023
Full list with links to PDFs available on our Publications Page.
🤝 Get Involved
We welcome contributions from the community – whether you are a researcher, developer, student, or native speaker.
For Researchers
- Use our models and datasets in your work (and cite us!).
- Collaborate on joint papers and grant proposals.
- Contribute new benchmarks or evaluation tasks.
For Developers
- Integrate our models into your applications.
- Report bugs or suggest improvements via GitHub Issues.
- Submit pull requests to our open‑source repositories.
For Native Speakers & Linguists
- Help us validate translations and annotations.
- Share texts or corpora (with permission) to enrich our data.
- Provide feedback on model outputs to reduce errors.
For Students
- Use our demos and tutorials for learning.
- Participate in our mentorship program or summer schools.
- Start your own research project with our support.
🌐 Connect With Us
- 🤗 Hugging Face: BashkirNLP – Models, datasets, and spaces.
- 💻 GitHub: BashkirNLP – Source code, development, and issue tracking.
- 📧 Email: contact@bashkirnlp.org – General inquiries and collaboration.
- 📝 Blog: Medium/BashkirNLP – In‑depth articles.
🔄 Ecosystem Integration
Our work is integrated with the broader Hugging Face ecosystem:
- Models on the Hub with easy‑to‑use pipelines.
- Datasets with streaming and evaluation scripts.
- Spaces for interactive demos and educational tools.
- Gradio apps for user‑friendly interfaces.
Empowering Bashkir and Turkic languages through open science and community collaboration.