Eigencharacter: An Embedding of Chinese Character Orthography

Yu-Hsiang Tseng, Shu-Kai Hsieh


Abstract
Chinese characters are unique in its logographic nature, which inherently encodes world knowledge through thousands of years evolution. This paper proposes an embedding approach, namely eigencharacter (EC) space, which helps NLP application easily access the knowledge encoded in Chinese orthography. These EC representations are automatically extracted, encode both structural and radical information, and easily integrate with other computational models. We built EC representations of 5,000 Chinese characters, investigated orthography knowledge encoded in ECs, and demonstrated how these ECs identified visually similar characters with both structural and radical information.
Anthology ID:
D19-6404
Volume:
Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN)
Month:
November
Year:
2019
Address:
Hong Kong, China
Editors:
Aditya Mogadala, Dietrich Klakow, Sandro Pezzelle, Marie-Francine Moens
Venue:
WS
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
24–28
Language:
URL:
https://aclanthology.org/D19-6404
DOI:
10.18653/v1/D19-6404
Bibkey:
Cite (ACL):
Yu-Hsiang Tseng and Shu-Kai Hsieh. 2019. Eigencharacter: An Embedding of Chinese Character Orthography. In Proceedings of the Beyond Vision and LANguage: inTEgrating Real-world kNowledge (LANTERN), pages 24–28, Hong Kong, China. Association for Computational Linguistics.
Cite (Informal):
Eigencharacter: An Embedding of Chinese Character Orthography (Tseng & Hsieh, 2019)
Copy Citation:
PDF:
https://preview.aclanthology.org/dois-2013-emnlp/D19-6404.pdf