Mehran Sarmadi
2025
FaMTEB: Massive Text Embedding Benchmark in Persian Language
Erfan Zinvandi
|
Morteza Alikhani
|
Mehran Sarmadi
|
Zahra Pourbahman
|
Sepehr Arvin
|
Reza Kazemi
|
Arash Amini
Findings of the Association for Computational Linguistics: EMNLP 2025
In this paper, we introduce a comprehensive benchmark for Persian (Farsi) text embeddings, built upon the Massive Text Embedding Benchmark (MTEB). Our benchmark includes 63 datasets spanning seven different tasks: classification, clustering, pair classification, reranking, retrieval, summary retrieval, and semantic textual similarity. The datasets are a combination of existing, translated, and newly generated (synthetic) data, offering a diverse and robust evaluation framework for Persian language models. All newly translated and synthetic datasets were rigorously evaluated by both humans and automated systems to ensure high quality and reliability. Given the growing adoption of text embedding models in chatbots, evaluation datasets are becoming an essential component of chatbot development and Retrieval-Augmented Generation (RAG) systems. As a contribution, we include chatbot evaluation datasets in the MTEB benchmark for the first time. Additionally, we introduce the novel task of summary retrieval, which is not included in the standard MTEB tasks. Another key contribution of this work is the introduction of a substantial number of new Persian-language NLP datasets for both training and evaluation, many of which have no existing counterparts in Persian. We evaluate the performance of several Persian and multilingual embedding models across a wide range of tasks. This work presents an open-source benchmark with datasets, accompanying code, and a public leaderboard.
PerCoR: Evaluating Commonsense Reasoning in Persian via Multiple-Choice Sentence Completion
Morteza Alikhani
|
Mohammadtaha Bagherifard
|
Erfan Zinvandi
|
Mehran Sarmadi
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
We introduced PerCoR—Persian Commonsense Reasoning—the first large-scale Persian benchmark for commonsense reasoning. PerCoR contains 106K multiple-choice sentence-completion problems drawn from more than forty news, cultural and other web sources. We adopt a linguistically grounded, conjunction-based segmentation strategy to generate coherent prefix–continuation pairs. To create challenging distractors, we propose DRESS-AF—Distractor Ranking via Embedding Similarity Scoring and Adversarial Filtering—a generation-free adversarial filtering method that selects distractors from the pool of gold continuations while maximising model confusion. Human annotators score 89% on PerCoR, while OpenAI-o3 achieves the highest performance at 92.18%, followed closely by Claude-Sonnet-3.7 (91.17%). The strongest open-source model, DeepSeek-R1, reaches 82.51%, underscoring both the dataset’s difficulty and the remaining performance gap in Persian commonsense reasoning. We further show that DRESS-AF transfers to the English HellaSwag benchmark, increasing its difficulty without hurting human solvability. The dataset is available at https://huggingface.co/datasets/MCINext/PerCoR .
Search
Fix author
Co-authors
- Morteza Alikhani 2
- Erfan Zinvandi 2
- Arash Amini 1
- Sepehr Arvin 1
- Mohammadtaha Bagherifard 1
- show all...