PL-MTEB: Polish Massive Text Embedding Benchmark

Rafa{\l} Po\'swiata, S{\l}awomir Dadas, Micha{\l} Wiktor Pere{\l}kiewicz


Abstract
In this paper, we introduce the Polish Massive Text Embedding Benchmark (PL-MTEB), a comprehensive benchmark for text embeddings in the Polish language. PL-MTEB comprises 30 diverse NLP tasks across five categories: classification, clustering, pair classification, information retrieval, and semantic text similarity. Within the scope of this work, we added 12 new Polish-language tasks to MTEB based on existing datasets and prepared two new datasets used to create four clustering tasks. We evaluated 30 publicly available text embedding models, including Polish and multilingual models. We analyzed the results in detail for specific task types and model sizes. We made the prepared datasets, the source code for evaluation, and the obtained results available to the public at https://github.com/rafalposwiata/pl-mteb.
Anthology ID:
2026.findings-acl.1773
Volume:
Findings of the Association for Computational Linguistics: ACL 2026
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
35601–35619
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1773/
DOI:
Bibkey:
Cite (ACL):
Rafa{\l} Po\'swiata, S{\l}awomir Dadas, and Micha{\l} Wiktor Pere{\l}kiewicz. 2026. PL-MTEB: Polish Massive Text Embedding Benchmark. In Findings of the Association for Computational Linguistics: ACL 2026, pages 35601–35619, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
PL-MTEB: Polish Massive Text Embedding Benchmark (Po'swiata et al., Findings 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1773.pdf
Checklist:
 2026.findings-acl.1773.checklist.pdf