---
title: TR-MTEB
emoji: 🇹🇷📊
colorFrom: purple
colorTo: blue
sdk: static
pinned: true
---

# TR-MTEB: Turkish Massive Text Embedding Benchmark

Welcome to the official Hugging Face organization for **TR-MTEB**,  
the first large-scale and task-diverse benchmark for evaluating **Turkish sentence embedding models**.

---

## 📌 Paper

**TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations**  
Mehmet Selman Baysan, Tunga Gungor  
*Findings of EMNLP 2025*

- 📄 ACL Anthology: https://aclanthology.org/2025.findings-emnlp.471/  
- DOI: https://doi.org/10.18653/v1/2025.findings-emnlp.471

> We introduce TR-MTEB, the first comprehensive benchmark for Turkish sentence representations, covering six core embedding evaluation tasks and 26 datasets.

---

## 🔍 Benchmark Overview

TR-MTEB provides evaluation across **6 major embedding task categories**:

- **Classification**
- **Clustering**
- **Pair Classification**
- **Retrieval**
- **Bitext Mining**
- **Semantic Textual Similarity (STS)**

📊 Total datasets included: **26**  
🌍 Combination of native Turkish + high-quality translated datasets

---

## 🧠 Turkish Embedding Models

To complement the benchmark, we also release Turkish-specific embedding models trained on:

- **34.2 million weakly supervised Turkish sentence pairs**
- Contrastive pretraining + supervised fine-tuning

These models achieve strong performance and significantly outperform monolingual baselines.

---

## 📂 Released Resources

This organization hosts:

✅ Benchmark datasets  
✅ Evaluation pipeline  
✅ Turkish embedding model suite  
✅ Training corpus and scripts (where applicable)

All resources are released publicly to support research in:

- Turkish NLP  
- Low-resource language embeddings  
- Multilingual benchmark development  

---

## 🌟 Citation

If you use TR-MTEB in your work, please cite:

```bibtex
@inproceedings{baysan-gungor-2025-tr,
  title = "{TR}-{MTEB}: A Comprehensive Benchmark and Embedding Model Suite for {T}urkish Sentence Representations",
  author = "Baysan, Mehmet Selman and Gungor, Tunga",
  booktitle = "Findings of the Association for Computational Linguistics: EMNLP 2025",
  month = nov,
  year = "2025",
  address = "Suzhou, China",
  publisher = "Association for Computational Linguistics",
  url = "https://aclanthology.org/2025.findings-emnlp.471/",
  doi = "10.18653/v1/2025.findings-emnlp.471",
  pages = "8867--8887"
}
```

---

## 🤝 Contact & Contributions
We welcome contributions, new datasets, and collaborations.

Author: Mehmet Selman Baysan (mselmanbaysan@gmail.com)

Organization: TR-MTEB Project

Feel free to open issues or discussions on Hugging Face.

🇹🇷 Building better embedding benchmarks for Turkish and low-resource languages.