Abstract
We train and evaluate Norwegian sentence embedding models using the contrastive learning methodology SimCSE. We start from pre-trained Norwegian encoder models and train both unsupervised and supervised models. The models are evaluated on a machine-translated version of semantic textual similarity datasets, as well as binary classification tasks. We show that we can train good Norwegian sentence embedding models, that clearly outperform the pre-trained encoder models, as well as the multilingual mBERT, on the task of sentence similarity.- Anthology ID:
- 2023.nodalida-1.23
- Volume:
- Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa)
- Month:
- May
- Year:
- 2023
- Address:
- Tórshavn, Faroe Islands
- Editors:
- Tanel Alumäe, Mark Fishel
- Venue:
- NoDaLiDa
- SIG:
- Publisher:
- University of Tartu Library
- Note:
- Pages:
- 228–237
- Language:
- URL:
- https://aclanthology.org/2023.nodalida-1.23
- DOI:
- Cite (ACL):
- Bernt Ivar Utstøl Nødland. 2023. Training and Evaluating Norwegian Sentence Embedding Models. In Proceedings of the 24th Nordic Conference on Computational Linguistics (NoDaLiDa), pages 228–237, Tórshavn, Faroe Islands. University of Tartu Library.
- Cite (Informal):
- Training and Evaluating Norwegian Sentence Embedding Models (Nødland, NoDaLiDa 2023)
- PDF:
- https://preview.aclanthology.org/nschneid-patch-5/2023.nodalida-1.23.pdf