Exploiting Common Characters in Chinese and Japanese to Learn Cross-Lingual Word Embeddings via Matrix Factorization

Jilei Wang, Shiying Luo, Weiyan Shi, Tao Dai, Shu-Tao Xia


Abstract
Learning vector space representation of words (i.e., word embeddings) has recently attracted wide research interests, and has been extended to cross-lingual scenario. Currently most cross-lingual word embedding learning models are based on sentence alignment, which inevitably introduces much noise. In this paper, we show in Chinese and Japanese, the acquisition of semantic relation among words can benefit from the large number of common characters shared by both languages; inspired by this unique feature, we design a method named CJC targeting to generate cross-lingual context of words. We combine CJC with GloVe based on matrix factorization, and then propose an integrated model named CJ-Glo. Taking two sentence-aligned models and CJ-BOC (also exploits common characters but is based on CBOW) as baseline algorithms, we compare them with CJ-Glo on a series of NLP tasks including cross-lingual synonym, word analogy and sentence alignment. The result indicates CJ-Glo achieves the best performance among these methods, and is more stable in cross-lingual tasks; moreover, compared with CJ-BOC, CJ-Glo is less sensitive to the alteration of parameters.
Anthology ID:
W18-3015
Volume:
Proceedings of the Third Workshop on Representation Learning for NLP
Month:
July
Year:
2018
Address:
Melbourne, Australia
Editors:
Isabelle Augenstein, Kris Cao, He He, Felix Hill, Spandana Gella, Jamie Kiros, Hongyuan Mei, Dipendra Misra
Venue:
RepL4NLP
SIG:
SIGREP
Publisher:
Association for Computational Linguistics
Note:
Pages:
113–121
Language:
URL:
https://aclanthology.org/W18-3015
DOI:
10.18653/v1/W18-3015
Bibkey:
Cite (ACL):
Jilei Wang, Shiying Luo, Weiyan Shi, Tao Dai, and Shu-Tao Xia. 2018. Exploiting Common Characters in Chinese and Japanese to Learn Cross-Lingual Word Embeddings via Matrix Factorization. In Proceedings of the Third Workshop on Representation Learning for NLP, pages 113–121, Melbourne, Australia. Association for Computational Linguistics.
Cite (Informal):
Exploiting Common Characters in Chinese and Japanese to Learn Cross-Lingual Word Embeddings via Matrix Factorization (Wang et al., RepL4NLP 2018)
Copy Citation:
PDF:
https://preview.aclanthology.org/naacl24-info/W18-3015.pdf