2022
pdf
abs
Towards Attribute-Entangled Controllable Text Generation: A Pilot Study of Blessing Generation
Shulin Huang
|
Shirong Ma
|
Yinghui Li
|
Li Yangning
|
Shiyang Lin
|
Haitao Zheng
|
Ying Shen
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Controllable Text Generation (CTG) has obtained great success due to its fine-grained generation ability obtained by focusing on multiple attributes. However, most existing CTG researches overlook how to utilize the attribute entanglement to enhance the diversity of the controlled generated texts. Facing this dilemma, we focus on a novel CTG scenario, i.e., blessing generation which is challenging because high-quality blessing texts require CTG models to comprehensively consider the entanglement between multiple attributes (e.g., objects and occasions). To promote the research on blessing generation, we present EBleT, a large-scale Entangled Blessing Text dataset containing 293K English sentences annotated with multiple attributes. Furthermore, we propose novel evaluation metrics to measure the quality of the blessing texts generated by the baseline models we designed. Our study opens a new research direction for controllable text generation and enables the development of attribute-entangled CTG models.
pdf
abs
Learning from the Dictionary: Heterogeneous Knowledge Guided Fine-tuning for Chinese Spell Checking
Yinghui Li
|
Shirong Ma
|
Qingyu Zhou
|
Zhongli Li
|
Li Yangning
|
Shulin Huang
|
Ruiyang Liu
|
Chao Li
|
Yunbo Cao
|
Haitao Zheng
Findings of the Association for Computational Linguistics: EMNLP 2022
Chinese Spell Checking (CSC) aims to detect and correct Chinese spelling errors. Recent researches start from the pretrained knowledge of language models and take multimodal information into CSC models to improve the performance. However, they overlook the rich knowledge in the dictionary, the reference book where one can learn how one character should be pronounced, written, and used. In this paper, we propose the LEAD framework, which renders the CSC model to learn heterogeneous knowledge from the dictionary in terms of phonetics, vision, and meaning. LEAD first constructs positive and negative samples according to the knowledge of character phonetics, glyphs, and definitions in the dictionary. Then a unified contrastive learning-based training scheme is employed to refine the representations of the CSC models. Extensive experiments and detailed analyses on the SIGHAN benchmark datasets demonstrate the effectiveness of our proposed methods.
pdf
abs
Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction
Shirong Ma
|
Yinghui Li
|
Rongyi Sun
|
Qingyu Zhou
|
Shulin Huang
|
Ding Zhang
|
Li Yangning
|
Ruiyang Liu
|
Zhongli Li
|
Yunbo Cao
|
Haitao Zheng
|
Ying Shen
Findings of the Association for Computational Linguistics: EMNLP 2022
Chinese Grammatical Error Correction (CGEC) is both a challenging NLP task and a common application in human daily life. Recently, many data-driven approaches are proposed for the development of CGEC research. However, there are two major limitations in the CGEC field: First, the lack of high-quality annotated training corpora prevents the performance of existing CGEC models from being significantly improved. Second, the grammatical errors in widely used test sets are not made by native Chinese speakers, resulting in a significant gap between the CGEC models and the real application. In this paper, we propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors. Additionally, we present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios. Extensive experiments and detailed analyses not only demonstrate that the training data constructed by our method effectively improves the performance of CGEC models, but also reflect that our benchmark is an excellent resource for further development of the CGEC field.