Evaluation of Sentence Representations in Polish

Slawomir Dadas, Michał Perełkiewicz, Rafał Poświata


Abstract
Methods for learning sentence representations have been actively developed in recent years. However, the lack of pre-trained models and datasets annotated at the sentence level has been a problem for low-resource languages such as Polish which led to less interest in applying these methods to language-specific tasks. In this study, we introduce two new Polish datasets for evaluating sentence embeddings and provide a comprehensive evaluation of eight sentence representation methods including Polish and multilingual models. We consider classic word embedding models, recently developed contextual embeddings and multilingual sentence encoders, showing strengths and weaknesses of specific approaches. We also examine different methods of aggregating word vectors into a single sentence vector.
Anthology ID:
2020.lrec-1.207
Volume:
Proceedings of the 12th Language Resources and Evaluation Conference
Month:
May
Year:
2020
Address:
Marseille, France
Venue:
LREC
SIG:
Publisher:
European Language Resources Association
Note:
Pages:
1674–1680
Language:
English
URL:
https://aclanthology.org/2020.lrec-1.207
DOI:
Bibkey:
Cite (ACL):
Slawomir Dadas, Michał Perełkiewicz, and Rafał Poświata. 2020. Evaluation of Sentence Representations in Polish. In Proceedings of the 12th Language Resources and Evaluation Conference, pages 1674–1680, Marseille, France. European Language Resources Association.
Cite (Informal):
Evaluation of Sentence Representations in Polish (Dadas et al., LREC 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/update-css-js/2020.lrec-1.207.pdf