Abstract
Measuring the similarity score between a pair of sentences in different languages is the essential requisite for multilingual sentence embedding methods. Predicting the similarity score consists of two sub-tasks, which are monolingual similarity evaluation and multilingual sentence retrieval. However, conventional methods have mainly tackled only one of the sub-tasks and therefore showed biased performances. In this paper, we suggest a novel and strong method for multilingual sentence embedding, which shows performance improvement on both sub-tasks, consequently resulting in robust predictions of multilingual similarity scores. The suggested method consists of two parts: to learn semantic similarity of sentences in the pivot language and then to extend the learned semantic structure to different languages. To align semantic structures across different languages, we introduce a teacher-student network. The teacher network distills the knowledge of the pivot language to different languages of the student network. During the distillation, the parameters of the teacher network are updated with the slow-moving average. Together with the distillation and the parameter updating, the semantic structure of the student network can be directly aligned across different languages while preserving the ability to measure the semantic similarity. Thus, the multilingual training method drives performance improvement on multilingual similarity evaluation. The suggested model achieves the state-of-the-art performance on extended STS 2017 multilingual similarity evaluation as well as two sub-tasks, which are extended STS 2017 monolingual similarity evaluation and Tatoeba multilingual retrieval in 14 languages.- Anthology ID:
- 2021.findings-emnlp.153
- Volume:
- Findings of the Association for Computational Linguistics: EMNLP 2021
- Month:
- November
- Year:
- 2021
- Address:
- Punta Cana, Dominican Republic
- Editors:
- Marie-Francine Moens, Xuanjing Huang, Lucia Specia, Scott Wen-tau Yih
- Venue:
- Findings
- SIG:
- SIGDAT
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 1781–1791
- Language:
- URL:
- https://aclanthology.org/2021.findings-emnlp.153
- DOI:
- 10.18653/v1/2021.findings-emnlp.153
- Cite (ACL):
- Jiyeon Ham and Eun-Sol Kim. 2021. Semantic Alignment with Calibrated Similarity for Multilingual Sentence Embedding. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 1781–1791, Punta Cana, Dominican Republic. Association for Computational Linguistics.
- Cite (Informal):
- Semantic Alignment with Calibrated Similarity for Multilingual Sentence Embedding (Ham & Kim, Findings 2021)
- PDF:
- https://preview.aclanthology.org/naacl24-info/2021.findings-emnlp.153.pdf