An Unsupervised Framework for Adaptive Context-aware Simplified-Traditional Chinese Character Conversion

Wei Li, Shutan Huang, Yanqiu Shao


Abstract
Traditional Chinese character is an important carrier of Chinese culture, and is still actively used in many areas. Automatic conversion between traditional and simplified Chinese characters can help modern people understand traditional culture and facilitate communication among different regions. Previous conversion methods rely on rule-based mapping or shallow feature-based machine learning models, which struggle to convert simplified characters with different origins and constructing training data is costly. In this study, we propose an unsupervised adaptive context-aware conversion model that learns to convert between simplified and traditional Chinese characters under a denoising auto-encoder framework requiring no labeled data. Our model includes a Latent Generative Adversarial Encoder that transforms vectors to a latent space with generative adversarial network, which adds noise as an inevitable side effect, Based on which a Context-aware Semantic Reconstruction Decoder restores the original input while considering a broader range of context with a pretrained language model. Additionally, we propose to apply early exit mechanism during inference to reduce the computation complexity and improve the generalization ability. To test the effectiveness of our model, we construct a high quality test dataset with simplified-traditional Chinese character text pairs. Experiment results and extensive analysis demonstrate that our model outperforms strong unsupervised baselines and yields better conversion result for one-to-many cases.
Anthology ID:
2024.lrec-main.118
Volume:
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Month:
May
Year:
2024
Address:
Torino, Italia
Editors:
Nicoletta Calzolari, Min-Yen Kan, Veronique Hoste, Alessandro Lenci, Sakriani Sakti, Nianwen Xue
Venues:
LREC | COLING
SIG:
Publisher:
ELRA and ICCL
Note:
Pages:
1318–1326
Language:
URL:
https://aclanthology.org/2024.lrec-main.118
DOI:
Bibkey:
Cite (ACL):
Wei Li, Shutan Huang, and Yanqiu Shao. 2024. An Unsupervised Framework for Adaptive Context-aware Simplified-Traditional Chinese Character Conversion. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 1318–1326, Torino, Italia. ELRA and ICCL.
Cite (Informal):
An Unsupervised Framework for Adaptive Context-aware Simplified-Traditional Chinese Character Conversion (Li et al., LREC-COLING 2024)
Copy Citation:
PDF:
https://preview.aclanthology.org/nschneid-patch-2/2024.lrec-main.118.pdf