Chun Lin


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2023

pdf bib
Improved Unsupervised Chinese Word Segmentation Using Pre-trained Knowledge and Pseudo-labeling Transfer
Hsiu-Wen Li | Ying-Jia Lin | Yi-Ting Li | Chun Lin | Hung-Yu Kao
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Unsupervised Chinese word segmentation (UCWS) has made progress by incorporating linguistic knowledge from pre-trained language models using parameter-free probing techniques. However, such approaches suffer from increased training time due to the need for multiple inferences using a pre-trained language model to perform word segmentation. This work introduces a novel way to enhance UCWS performance while maintaining training efficiency. Our proposed method integrates the segmentation signal from the unsupervised segmental language model to the pre-trained BERT classifier under a pseudo-labeling framework. Experimental results demonstrate that our approach achieves state-of-the-art performance on the eight UCWS tasks while considerably reducing the training time compared to previous approaches.

pdf bib
Improving Multi-Criteria Chinese Word Segmentation through Learning Sentence Representation
Chun Lin | Ying-Jia Lin | Chia-Jen Yeh | Yi-Ting Li | Ching-Wen Yang | Hung-Yu Kao
Findings of the Association for Computational Linguistics: EMNLP 2023

Recent Chinese word segmentation (CWS) models have shown competitive performance with pre-trained language models’ knowledge. However, these models tend to learn the segmentation knowledge through in-vocabulary words rather than understanding the meaning of the entire context. To address this issue, we introduce a context-aware approach that incorporates unsupervised sentence representation learning over different dropout masks into the multi-criteria training framework. We demonstrate that our approach reaches state-of-the-art (SoTA) performance on F1 scores for six of the nine CWS benchmark datasets and out-of-vocabulary (OOV) recalls for eight of nine. Further experiments discover that substantial improvements can be brought with various sentence representation objectives.