UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database

Canwen Xu, Tao Ge, Chenliang Li, Furu Wei


Abstract
Chinese and Japanese share many characters with similar surface morphology. To better utilize the shared knowledge across the languages, we propose UnihanLM, a self-supervised Chinese-Japanese pretrained masked language model (MLM) with a novel two-stage coarse-to-fine training approach. We exploit Unihan, a ready-made database constructed by linguistic experts to first merge morphologically similar characters into clusters. The resulting clusters are used to replace the original characters in sentences for the coarse-grained pretraining of the MLM. Then, we restore the clusters back to the original characters in sentences for the fine-grained pretraining to learn the representation of the specific characters. We conduct extensive experiments on a variety of Chinese and Japanese NLP benchmarks, showing that our proposed UnihanLM is effective on both mono- and cross-lingual Chinese and Japanese tasks, shedding light on a new path to exploit the homology of languages.
Anthology ID:
2020.aacl-main.24
Volume:
Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing
Month:
December
Year:
2020
Address:
Suzhou, China
Editors:
Kam-Fai Wong, Kevin Knight, Hua Wu
Venue:
AACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
201–211
Language:
URL:
https://aclanthology.org/2020.aacl-main.24
DOI:
Bibkey:
Cite (ACL):
Canwen Xu, Tao Ge, Chenliang Li, and Furu Wei. 2020. UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database. In Proceedings of the 1st Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics and the 10th International Joint Conference on Natural Language Processing, pages 201–211, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database (Xu et al., AACL 2020)
Copy Citation:
PDF:
https://preview.aclanthology.org/emnlp-22-attachments/2020.aacl-main.24.pdf
Code
 jetrunner/unihan-lm
Data
PAWS-XWord Sense Disambiguation: a Unified Evaluation Framework and Empirical Comparison