BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language
Konrad Wojtasik, Kacper Wołowiec, Vadim Shishkin, Arkadiusz Janz, Maciej Piasecki
Abstract
The BEIR dataset is a large, heterogeneous benchmark for Information Retrieval (IR), garnering considerable attention within the research community. However, BEIR and analogous datasets are predominantly restricted to English language. Our objective is to establish extensive large-scale resources for IR in the Polish language, thereby advancing the research in this NLP area. In this work, inspired by mMARCO and Mr. TyDi datasets, we translated all accessible open IR datasets into Polish, and we introduced the BEIR-PL benchmark – a new benchmark which comprises 13 datasets, facilitating further development, training and evaluation of modern Polish language models for IR tasks. We executed an evaluation and comparison of numerous IR models on the newly introduced BEIR-PL benchmark. Furthermore, we publish pre-trained open IR models for Polish language, marking a pioneering development in this field. The BEIR-PL is included in MTEB Benchmark and also available with trained models at URL https://huggingface.co/clarin-knext.- Anthology ID:
- 2024.lrec-main.194
- Volume:
- Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
- Month:
- May
- Year:
- 2024
- Address:
- Torino, Italia
- Editors:
- Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
- Venues:
- LREC | COLING
- SIG:
- Publisher:
- ELRA and ICCL
- Note:
- Pages:
- 2149–2160
- Language:
- URL:
- https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2024.lrec-main.194/
- DOI:
- Cite (ACL):
- Konrad Wojtasik, Kacper Wołowiec, Vadim Shishkin, Arkadiusz Janz, and Maciej Piasecki. 2024. BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 2149–2160, Torino, Italia. ELRA and ICCL.
- Cite (Informal):
- BEIR-PL: Zero Shot Information Retrieval Benchmark for the Polish Language (Wojtasik et al., LREC-COLING 2024)
- PDF:
- https://preview.aclanthology.org/sigedu-bea-out-of-sync-correction/2024.lrec-main.194.pdf