--- title: TR-MTEB emoji: 🇹🇷📊 colorFrom: purple colorTo: blue sdk: static pinned: true --- # TR-MTEB: Turkish Massive Text Embedding Benchmark Welcome to the official Hugging Face organization for **TR-MTEB**, the first large-scale and task-diverse benchmark for evaluating **Turkish sentence embedding models**. --- ## 📌 Paper **TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations** Mehmet Selman Baysan, Tunga Gungor *Findings of EMNLP 2025* - 📄 ACL Anthology: https://aclanthology.org/2025.findings-emnlp.471/ - DOI: https://doi.org/10.18653/v1/2025.findings-emnlp.471 > We introduce TR-MTEB, the first comprehensive benchmark for Turkish sentence representations, covering six core embedding evaluation tasks and 26 datasets. --- ## 🔍 Benchmark Overview TR-MTEB provides evaluation across **6 major embedding task categories**: - **Classification** - **Clustering** - **Pair Classification** - **Retrieval** - **Bitext Mining** - **Semantic Textual Similarity (STS)** 📊 Total datasets included: **26** 🌍 Combination of native Turkish + high-quality translated datasets --- ## 🧠 Turkish Embedding Models To complement the benchmark, we also release Turkish-specific embedding models trained on: - **34.2 million weakly supervised Turkish sentence pairs** - Contrastive pretraining + supervised fine-tuning These models achieve strong performance and significantly outperform monolingual baselines. --- ## 📂 Released Resources This organization hosts: ✅ Benchmark datasets ✅ Evaluation pipeline ✅ Turkish embedding model suite ✅ Training corpus and scripts (where applicable) All resources are released publicly to support research in: - Turkish NLP - Low-resource language embeddings - Multilingual benchmark development --- ## 🌟 Citation If you use TR-MTEB in your work, please cite: ```bibtex @inproceedings{baysan-gungor-2025-tr, title = "{TR}-{MTEB}: A Comprehensive Benchmark and Embedding Model Suite for {T}urkish Sentence Representations", author = "Baysan, Mehmet Selman and Gungor, Tunga", booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025", month = nov, year = "2025", address = "Suzhou, China", publisher = "Association for Computational Linguistics", url = "https://aclanthology.org/2025.findings-emnlp.471/", doi = "10.18653/v1/2025.findings-emnlp.471", pages = "8867--8887" } ``` --- ## 🤝 Contact & Contributions We welcome contributions, new datasets, and collaborations. Author: Mehmet Selman Baysan (mselmanbaysan@gmail.com) Organization: TR-MTEB Project Feel free to open issues or discussions on Hugging Face. 🇹🇷 Building better embedding benchmarks for Turkish and low-resource languages.