TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations

Mehmet Selman Baysan; Tunga Güngör

doi:10.18653/v1/2025.findings-emnlp.471

TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations

Abstract

We introduce TR-MTEB, the first large-scale, task-diverse benchmark designed to evaluate sentence embedding models for Turkish. Covering six core tasks as classification, clustering, pair classification, retrieval, bitext mining, and semantic textual similarity, TR-MTEB incorporates 26 high-quality datasets, including native and translated resources. To complement this benchmark, we construct a corpus of 34.2 million weakly supervised Turkish sentence pairs and train two Turkish-specific embedding models using contrastive pretraining and supervised fine-tuning. Evaluation results show that our models, despite being trained on limited resources, achieve competitive performance across most tasks and significantly improve upon baseline monolingual models. All datasets, models, and evaluation pipelines are publicly released to facilitate further research in Turkish natural language processing and low-resource benchmarking.

Anthology ID:: 2025.findings-emnlp.471
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 8867–8887
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.471/
DOI:: 10.18653/v1/2025.findings-emnlp.471
Bibkey:
Cite (ACL):: Mehmet Selman Baysan and Tunga Gungor. 2025. TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 8867–8887, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: TR-MTEB: A Comprehensive Benchmark and Embedding Model Suite for Turkish Sentence Representations (Baysan & Gungor, Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.471.pdf
Checklist:: 2025.findings-emnlp.471.checklist.pdf

PDF Cite Search Checklist Fix data