2024
pdf
abs
Evaluating Robustness of Generative Search Engine on Adversarial Factoid Questions
Xuming Hu
|
Xiaochuan Li
|
Junzhe Chen
|
Yinghui Li
|
Yangning Li
|
Xiaoguang Li
|
Yasheng Wang
|
Qun Liu
|
Lijie Wen
|
Philip Yu
|
Zhijiang Guo
Findings of the Association for Computational Linguistics ACL 2024
Generative search engines have the potential to transform how people seek information online, but generated responses from existing large language models (LLMs)-backed generative search engines may not always be accurate. Nonetheless, retrieval-augmented generation exacerbates safety concerns, since adversaries may successfully evade the entire system by subtly manipulating the most vulnerable part of a claim. To this end, we propose evaluating the robustness of generative search engines in the realistic and high-risk setting, where adversaries have only black-box system access and seek to deceive the model into returning incorrect responses. Through a comprehensive human evaluation of various generative search engines, such as Bing Chat, PerplexityAI, and YouChat across diverse queries, we demonstrate the effectiveness of adversarial factual questions in inducing incorrect responses. Moreover, retrieval-augmented generation exhibits a higher susceptibility to factual errors compared to LLMs without retrieval. These findings highlight the potential security risks of these systems and emphasize the need for rigorous evaluation before deployment. The dataset and code will be publicly available.
pdf
abs
GCNet: Global-and-Context Collaborative Learning for Aspect-Based Sentiment Analysis
Ting Zhou
|
Ying Shen
|
Yinghui Li
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Aspect-Based Sentiment Analysis (ABSA) aims to determine the sentiment polarities of specified aspect terms in a sentence. Most previous approaches mainly use an attention mechanism or graph neural networks based on dependency trees to explicitly model the connections between aspect terms and opinion words. However, these methods may not effectively address cases where the sentiment of an aspect term is implicitly described, as the corresponding opinion words may not directly appear in the sentence. To alleviate this issue, in this paper, we propose a GCNet that explicitly leverages global semantic information to guide context encoding. Particularly, we design a semantics encoding module that incorporates global semantic features into sequential modeling process to enable the consideration of the overall sentiment tendency of a sentence, while the global semantic features are also refined by adaptively focusing on different parts of the sentence. Moreover, for a comprehensive sentence analysis, we also include a syntactic feature encoding module along with a pre-fusion module to integrate the refined global features with the syntactic representations. Extensive experiments on three public datasets demonstrate that our model outperforms state-of-the-art methods, indicating the robustness and effectiveness of our approach.
pdf
abs
LatEval: An Interactive LLMs Evaluation Benchmark with Incomplete Information from Lateral Thinking Puzzles
Shulin Huang
|
Shirong Ma
|
Yinghui Li
|
Mengzuo Huang
|
Wuhe Zou
|
Weidong Zhang
|
Haitao Zheng
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
With the evolution of LLMs, they are endowed with impressive logical reasoning, or vertical thinking capabilities. But can they think out of the box? Do they possess proficient lateral thinking abilities? Following the setup of Lateral Thinking Puzzles, we propose a novel evaluation benchmark, LatEval, which assesses the model’s lateral thinking within an interactive framework. In our benchmark, we challenge LLMs with 2 aspects: (1) posing high-quality questions that break out of conventional norms but are beneficial for puzzle-solving. (2) integrating existing information to gradually deduce the truth through reasoning. We observe that it is hard for most LLMs to accomplish lateral thinking during interactions. Even the most powerful LLM, GPT-4, faces challenges in achieving satisfactory performance, and for most open-source models, simply completing this task is quite difficult. This evaluation benchmark provides LLMs with a highly challenging and differentiating task that is crucial to an effective AI assistant. Our dataset and source codes are available at https://github.com/THUKElab/LatEval.
pdf
abs
Source-free Domain Adaptation for Aspect-based Sentiment Analysis
Zishuo Zhao
|
Ziyang Ma
|
Zhenzhou Lin
|
Jingyou Xie
|
Yinghui Li
|
Ying Shen
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Unsupervised Domain Adaptation (UDA) of the Aspect-based Sentiment Analysis (ABSA) task aims to transfer knowledge learned from labeled source domain datasets to unlabeled target domains on the assumption that samples from the source domain are freely accessible during the training period. However, this assumption can easily lead to privacy invasion issues in real-world applications, especially when the source data involves privacy-preserving domains such as healthcare and finance. In this paper, we introduce the Source-Free Domain Adaptation Framework for ABSA (SF-ABSA), which only allows model parameter transfer, not data transfer, between different domains. Specifically, the proposed SF-ABSA framework consists of two parts, i.e., feature-based adaptation and pseudo-label-based adaptation. Experiment results on four benchmarks show that the proposed framework performs competitively with traditional unsupervised domain adaptation methods under the premise of insufficient information, which demonstrates the superiority of our method under privacy conditions.
pdf
abs
Towards Real-World Writing Assistance: A Chinese Character Checking Benchmark with Faked and Misspelled Characters
Yinghui Li
|
Zishan Xu
|
Shaoshen Chen
|
Haojing Huang
|
Yangning Li
|
Shirong Ma
|
Yong Jiang
|
Zhongli Li
|
Qingyu Zhou
|
Hai-Tao Zheng
|
Ying Shen
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Writing assistance aims to improve the correctness and quality of input texts, with character checking being crucial in detecting and correcting wrong characters. In the real world where handwriting occupies the vast majority, characters that humans get wrong include faked characters (i.e., untrue characters created due to writing errors) and misspelled characters (i.e., true characters used incorrectly due to spelling errors). However, existing datasets and related studies only focus on misspelled characters that can be represented by computer text encoding systems, thereby ignoring faked characters which are more common and difficult. To break through this dilemma, we present Visual-C3, a human-annotated Visual Chinese Character Checking dataset with faked and misspelled Chinese characters. To the best of our knowledge, Visual-C3 is the first real-world visual and the largest human-crafted dataset for the Chinese character checking scenario. Additionally, we also propose and evaluate novel baseline methods on Visual-C3. Extensive empirical results and analyses show that Visual-C3 is high-quality yet challenging. As the first study focusing on Chinese faked characters, the dataset and the baseline methods are publicly available at https://github.com/THUKElab/Visual-C3.
2023
pdf
abs
System Report for CCL23-Eval Task 7: THU KELab (sz) - Exploring Data Augmentation and Denoising for Chinese Grammatical Error Correction
Jingheng Ye
|
Yinghui Li
|
Haitao Zheng
Proceedings of the 22nd Chinese National Conference on Computational Linguistics (Volume 3: Evaluations)
“This paper explains our GEC system submitted by THU KELab (sz) in the CCL2023-Eval Task7 CLTC (Chinese Learner Text Correction) Track 1: Multidimensional Chinese Learner TextCorrection. Recent studies have demonstrate GEC performance can be improved by increasingthe amount of training data. However, high-quality public GEC data is much less abundant. To address this issue, we propose two data-driven techniques, data augmentation and data de-noising, to improve the GEC performance. Data augmentation creates pseudo data to enhancegeneralization, while data denoising removes noise from the realistic training data. The resultson the official evaluation dataset YACLC demonstrate the effectiveness of our approach. Finally,our GEC system ranked second in both close and open tasks. All of our datasets and codes areavailabel at
https://github.com/THUKElab/CCL2023-CLTC-THU_KELab.”
pdf
abs
MixEdit: Revisiting Data Augmentation and Beyond for Grammatical Error Correction
Jingheng Ye
|
Yinghui Li
|
Yangning Li
|
Hai-Tao Zheng
Findings of the Association for Computational Linguistics: EMNLP 2023
Data Augmentation through generating pseudo data has been proven effective in mitigating the challenge of data scarcity in the field of Grammatical Error Correction (GEC). Various augmentation strategies have been widely explored, most of which are motivated by two heuristics, i.e., increasing the distribution similarity and diversity of pseudo data. However, the underlying mechanism responsible for the effectiveness of these strategies remains poorly understood. In this paper, we aim to clarify how data augmentation improves GEC models. To this end, we introduce two interpretable and computationally efficient measures: Affinity and Diversity. Our findings indicate that an excellent GEC data augmentation strategy characterized by high Affinity and appropriate Diversity can better improve the performance of GEC models. Based on this observation, we propose MixEdit, a data augmentation approach that strategically and dynamically augments realistic data, without requiring extra monolingual corpora. To verify the correctness of our findings and the effectiveness of the proposed MixEdit, we conduct experiments on mainstream English and Chinese GEC datasets. The results show that MixEdit substantially improves GEC models and is complementary to traditional data augmentation methods. All the source codes of MixEdit are released at https://github.com/THUKElab/MixEdit.
pdf
abs
A Frustratingly Easy Plug-and-Play Detection-and-Reasoning Module for Chinese Spelling Check
Haojing Huang
|
Jingheng Ye
|
Qingyu Zhou
|
Yinghui Li
|
Yangning Li
|
Feng Zhou
|
Hai-Tao Zheng
Findings of the Association for Computational Linguistics: EMNLP 2023
In recent years, Chinese Spelling Check (CSC) has been greatly improved by designing task-specific pre-training methods or introducing auxiliary tasks, which mostly solve this task in an end-to-end fashion. In this paper, we propose to decompose the CSC workflow into detection, reasoning, and searching subtasks so that the rich external knowledge about the Chinese language can be leveraged more directly and efficiently. Specifically, we design a plug-and-play detection-and-reasoning module that is compatible with existing SOTA non-autoregressive CSC models to further boost their performance. We find that the detection-and-reasoning module trained for one model can also benefit other models. We also study the primary interpretability provided by the task decomposition. Extensive experiments and detailed analyses demonstrate the effectiveness and competitiveness of the proposed module.
pdf
abs
DAMO-NLP at SemEval-2023 Task 2: A Unified Retrieval-augmented System for Multilingual Named Entity Recognition
Zeqi Tan
|
Shen Huang
|
Zixia Jia
|
Jiong Cai
|
Yinghui Li
|
Weiming Lu
|
Yueting Zhuang
|
Kewei Tu
|
Pengjun Xie
|
Fei Huang
|
Yong Jiang
Proceedings of the 17th International Workshop on Semantic Evaluation (SemEval-2023)
The MultiCoNER II shared task aims to tackle multilingual named entity recognition (NER) in fine-grained and noisy scenarios, and it inherits the semantic ambiguity and low-context setting of the MultiCoNER I task. To cope with these problems, the previous top systems in the MultiCoNER I either incorporate the knowledge bases or gazetteers. However, they still suffer from insufficient knowledge, limited context length, single retrieval strategy. In this paper, our team DAMO-NLP proposes a unified retrieval-augmented system (U-RaNER) for fine-grained multilingual NER. We perform error analysis on the previous top systems and reveal that their performance bottleneck lies in insufficient knowledge. Also, we discover that the limited context length causes the retrieval knowledge to be invisible to the model. To enhance the retrieval context, we incorporate the entity-centric Wikidata knowledge base, while utilizing the infusion approach to broaden the contextual scope of the model. Also, we explore various search strategies and refine the quality of retrieval knowledge. Our system wins 9 out of 13 tracks in the MultiCoNER II shared task. Additionally, we compared our system with ChatGPT, one of the large language models which have unlocked strong capabilities on many tasks. The results show that there is still much room for improvement for ChatGPT on the extraction task.
pdf
abs
CLEME: Debiasing Multi-reference Evaluation for Grammatical Error Correction
Jingheng Ye
|
Yinghui Li
|
Qingyu Zhou
|
Yangning Li
|
Shirong Ma
|
Hai-Tao Zheng
|
Ying Shen
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Evaluating the performance of Grammatical Error Correction (GEC) systems is a challenging task due to its subjectivity. Designing an evaluation metric that is as objective as possible is crucial to the development of GEC task. However, mainstream evaluation metrics, i.e., reference-based metrics, introduce bias into the multi-reference evaluation by extracting edits without considering the presence of multiple references. To overcome this issue, we propose Chunk-LE Multi-reference Evaluation (CLEME), designed to evaluate GEC systems in the multi-reference evaluation setting. CLEME builds chunk sequences with consistent boundaries for the source, the hypothesis and references, thus eliminating the bias caused by inconsistent edit boundaries. Furthermore, we observe the consistent boundary could also act as the boundary of grammatical errors, based on which the F0.5 score is then computed following the correction independence assumption. We conduct experiments on six English reference sets based on the CoNLL-2014 shared task. Extensive experiments and detailed analyses demonstrate the correctness of our discovery and the effectiveness of CLEME. Further analysis reveals that CLEME is robust to evaluate GEC systems across reference sets with varying numbers of references and annotation styles. All the source codes of CLEME are released at https://github.com/THUKElab/CLEME.
2022
pdf
abs
Towards Attribute-Entangled Controllable Text Generation: A Pilot Study of Blessing Generation
Shulin Huang
|
Shirong Ma
|
Yinghui Li
|
Li Yangning
|
Shiyang Lin
|
Haitao Zheng
|
Ying Shen
Proceedings of the 2nd Workshop on Natural Language Generation, Evaluation, and Metrics (GEM)
Controllable Text Generation (CTG) has obtained great success due to its fine-grained generation ability obtained by focusing on multiple attributes. However, most existing CTG researches overlook how to utilize the attribute entanglement to enhance the diversity of the controlled generated texts. Facing this dilemma, we focus on a novel CTG scenario, i.e., blessing generation which is challenging because high-quality blessing texts require CTG models to comprehensively consider the entanglement between multiple attributes (e.g., objects and occasions). To promote the research on blessing generation, we present EBleT, a large-scale Entangled Blessing Text dataset containing 293K English sentences annotated with multiple attributes. Furthermore, we propose novel evaluation metrics to measure the quality of the blessing texts generated by the baseline models we designed. Our study opens a new research direction for controllable text generation and enables the development of attribute-entangled CTG models.
pdf
abs
The Past Mistake is the Future Wisdom: Error-driven Contrastive Probability Optimization for Chinese Spell Checking
Yinghui Li
|
Qingyu Zhou
|
Yangning Li
|
Zhongli Li
|
Ruiyang Liu
|
Rongyi Sun
|
Zizhen Wang
|
Chao Li
|
Yunbo Cao
|
Hai-Tao Zheng
Findings of the Association for Computational Linguistics: ACL 2022
Chinese Spell Checking (CSC) aims to detect and correct Chinese spelling errors, which are mainly caused by the phonological or visual similarity. Recently, pre-trained language models (PLMs) promote the progress of CSC task. However, there exists a gap between the learned knowledge of PLMs and the goal of CSC task. PLMs focus on the semantics in text and tend to correct the erroneous characters to semantically proper or commonly used ones, but these aren’t the ground-truth corrections. To address this issue, we propose an Error-driven COntrastive Probability Optimization (ECOPO) framework for CSC task. ECOPO refines the knowledge representations of PLMs, and guides the model to avoid predicting these common characters through an error-driven way. Particularly, ECOPO is model-agnostic and it can be combined with existing CSC methods to achieve better performance. Extensive experiments and detailed analyses on SIGHAN datasets demonstrate that ECOPO is simple yet effective.
pdf
abs
Learning from the Dictionary: Heterogeneous Knowledge Guided Fine-tuning for Chinese Spell Checking
Yinghui Li
|
Shirong Ma
|
Qingyu Zhou
|
Zhongli Li
|
Li Yangning
|
Shulin Huang
|
Ruiyang Liu
|
Chao Li
|
Yunbo Cao
|
Haitao Zheng
Findings of the Association for Computational Linguistics: EMNLP 2022
Chinese Spell Checking (CSC) aims to detect and correct Chinese spelling errors. Recent researches start from the pretrained knowledge of language models and take multimodal information into CSC models to improve the performance. However, they overlook the rich knowledge in the dictionary, the reference book where one can learn how one character should be pronounced, written, and used. In this paper, we propose the LEAD framework, which renders the CSC model to learn heterogeneous knowledge from the dictionary in terms of phonetics, vision, and meaning. LEAD first constructs positive and negative samples according to the knowledge of character phonetics, glyphs, and definitions in the dictionary. Then a unified contrastive learning-based training scheme is employed to refine the representations of the CSC models. Extensive experiments and detailed analyses on the SIGHAN benchmark datasets demonstrate the effectiveness of our proposed methods.
pdf
abs
Linguistic Rules-Based Corpus Generation for Native Chinese Grammatical Error Correction
Shirong Ma
|
Yinghui Li
|
Rongyi Sun
|
Qingyu Zhou
|
Shulin Huang
|
Ding Zhang
|
Li Yangning
|
Ruiyang Liu
|
Zhongli Li
|
Yunbo Cao
|
Haitao Zheng
|
Ying Shen
Findings of the Association for Computational Linguistics: EMNLP 2022
Chinese Grammatical Error Correction (CGEC) is both a challenging NLP task and a common application in human daily life. Recently, many data-driven approaches are proposed for the development of CGEC research. However, there are two major limitations in the CGEC field: First, the lack of high-quality annotated training corpora prevents the performance of existing CGEC models from being significantly improved. Second, the grammatical errors in widely used test sets are not made by native Chinese speakers, resulting in a significant gap between the CGEC models and the real application. In this paper, we propose a linguistic rules-based approach to construct large-scale CGEC training corpora with automatically generated grammatical errors. Additionally, we present a challenging CGEC benchmark derived entirely from errors made by native Chinese speakers in real-world scenarios. Extensive experiments and detailed analyses not only demonstrate that the training data constructed by our method effectively improves the performance of CGEC models, but also reflect that our benchmark is an excellent resource for further development of the CGEC field.