Abstract
In this paper we continue experiments where neural machine translation training is used to produce joint cross-lingual fixed-dimensional sentence embeddings. In this framework we introduce a simple method of adding a loss to the learning objective which penalizes distance between representations of bilingually aligned sentences. We evaluate cross-lingual transfer using two approaches, cross-lingual similarity search on an aligned corpus (Europarl) and cross-lingual document classification on a recently published benchmark Reuters corpus, and we find the similarity loss significantly improves performance on both. Furthermore, we notice that while our Reuters results are very competitive, our English results are not as competitive, showing room for improvement in the current cross-lingual state-of-the-art. Our results are based on a set of 6 European languages.- Anthology ID:
- W18-3023
- Volume:
- Proceedings of the Third Workshop on Representation Learning for NLP
- Month:
- July
- Year:
- 2018
- Address:
- Melbourne, Australia
- Editors:
- Isabelle Augenstein, Kris Cao, He He, Felix Hill, Spandana Gella, Jamie Kiros, Hongyuan Mei, Dipendra Misra
- Venue:
- RepL4NLP
- SIG:
- SIGREP
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 175–179
- Language:
- URL:
- https://aclanthology.org/W18-3023
- DOI:
- 10.18653/v1/W18-3023
- Cite (ACL):
- Katherine Yu, Haoran Li, and Barlas Oguz. 2018. Multilingual Seq2seq Training with Similarity Loss for Cross-Lingual Document Classification. In Proceedings of the Third Workshop on Representation Learning for NLP, pages 175–179, Melbourne, Australia. Association for Computational Linguistics.
- Cite (Informal):
- Multilingual Seq2seq Training with Similarity Loss for Cross-Lingual Document Classification (Yu et al., RepL4NLP 2018)
- PDF:
- https://preview.aclanthology.org/ml4al-ingestion/W18-3023.pdf