FaMTEB: Massive Text Embedding Benchmark in Persian Language

Erfan Zinvandi; Morteza Alikhani; Mehran Sarmadi; Zahra Pourbahman; Sepehr Arvin; Reza Kazemi; Arash Amini

doi:10.18653/v1/2025.findings-emnlp.614

FaMTEB: Massive Text Embedding Benchmark in Persian Language

Erfan Zinvandi, Morteza Alikhani, Mehran Sarmadi, Zahra Pourbahman, Sepehr Arvin, Reza Kazemi, Arash Amini

Abstract

In this paper, we introduce a comprehensive benchmark for Persian (Farsi) text embeddings, built upon the Massive Text Embedding Benchmark (MTEB). Our benchmark includes 63 datasets spanning seven different tasks: classification, clustering, pair classification, reranking, retrieval, summary retrieval, and semantic textual similarity. The datasets are a combination of existing, translated, and newly generated (synthetic) data, offering a diverse and robust evaluation framework for Persian language models. All newly translated and synthetic datasets were rigorously evaluated by both humans and automated systems to ensure high quality and reliability. Given the growing adoption of text embedding models in chatbots, evaluation datasets are becoming an essential component of chatbot development and Retrieval-Augmented Generation (RAG) systems. As a contribution, we include chatbot evaluation datasets in the MTEB benchmark for the first time. Additionally, we introduce the novel task of summary retrieval, which is not included in the standard MTEB tasks. Another key contribution of this work is the introduction of a substantial number of new Persian-language NLP datasets for both training and evaluation, many of which have no existing counterparts in Persian. We evaluate the performance of several Persian and multilingual embedding models across a wide range of tasks. This work presents an open-source benchmark with datasets, accompanying code, and a public leaderboard.

Anthology ID:: 2025.findings-emnlp.614
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11441–11468
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.614/
DOI:: 10.18653/v1/2025.findings-emnlp.614
Bibkey:
Cite (ACL):: Erfan Zinvandi, Morteza Alikhani, Mehran Sarmadi, Zahra Pourbahman, Sepehr Arvin, Reza Kazemi, and Arash Amini. 2025. FaMTEB: Massive Text Embedding Benchmark in Persian Language. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 11441–11468, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: FaMTEB: Massive Text Embedding Benchmark in Persian Language (Zinvandi et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.614.pdf
Checklist:: 2025.findings-emnlp.614.checklist.pdf

PDF Cite Search Checklist Fix data