2025
pdf
bib
abs
Towards Harmonized Uncertainty Estimation for Large Language Models
Rui Li
|
Jing Long
|
Muge Qi
|
Heming Xia
|
Lei Sha
|
Peiyi Wang
|
Zhifang Sui
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
To facilitate robust and trustworthy deployment of large language models (LLMs), it is essential to quantify the reliability of their generations through uncertainty estimation. While recent efforts have made significant advancements by leveraging the internal logic and linguistic features of LLMs to estimate uncertainty scores, our empirical analysis highlights the pitfalls of these methods to strike a harmonized estimation between indication, balance, and calibration, which hinders their broader capability for accurate uncertainty estimation. To address this challenge, we propose CUE (Corrector for Uncertainty Estimation): A straightforward yet effective method that employs a lightweight model trained on data aligned with the target LLM’s performance to adjust uncertainty scores. Comprehensive experiments across diverse models and tasks demonstrate its effectiveness, which achieves consistent improvements of up to 60% over existing methods.
pdf
bib
abs
Multi-perspective Preference Alignment of LLMs for Programming-Community Question Answering
Hongyu Yang
|
Jiahui Hou
|
Liyang He
|
Rui Li
Proceedings of the 31st International Conference on Computational Linguistics
Programming-Community Question Answering (PCQA) aims to tackle issues through generating functional code and guiding descriptions. It involves multiple candidates, with different users having varying preferences for them. Additionally, one may contain outdated APIs. These undoubtedly present a challenge for responsing that meet user preferences. Recently, Reinforcement Learning from Human Feedback demonstrates its ability to precisely control the behavior of large language models (LLMs) to yield human-like responses. However, applying it to LLMs in domain-specific PCQA remains unexplored. In this work, we propose Multi-perspective Preference Alignment for Programming-Community Question Answering to generate user-centric responses, called MupPCQA. It includes three stages: Preference Standardization to control content quality, Preference Integration to consider diverse user tendencies, Preference Timeliness Mitigation to alleviate outdated answers. Extensive experiments on a high-quality, real-world PCQA dataset validate its accuracy and preference. Compared to its base model, MupPCQA shows an improvement of nearly 11% in BLEU, with increases of 20% and 17.5% in BERTScore and CodeBERTScore.
pdf
bib
abs
Automated Clinical Data Extraction with Knowledge Conditioned LLMs
Diya Li
|
Asim Kadav
|
Aijing Gao
|
Rui Li
|
Richard Bourgon
Proceedings of the 31st International Conference on Computational Linguistics: Industry Track
The extraction of lung lesion information from clinical and medical imaging reports is crucial for research on and clinical care of lung-related diseases. Large language models (LLMs) can be effective at interpreting unstructured text in reports, but they often hallucinate due to a lack of domain-specific knowledge, leading to reduced accuracy and posing challenges for use in clinical settings. To address this, we propose a novel framework that aligns generated internal knowledge with external knowledge through in-context learning (ICL). Our framework employs a retriever to identify relevant units of internal or external knowledge and a grader to evaluate the truthfulness and helpfulness of the retrieved internal-knowledge rules, to align and update the knowledge bases. Experiments with expert-curated test datasets demonstrate that this ICL approach can increase the F1 score for key fields (lesion size, margin and solidity) by an average of 12.9% over existing ICL methods.
pdf
bib
abs
Layer-Aware Representation Filtering: Purifying Finetuning Data to Preserve LLM Safety Alignment
Hao Li
|
Lijun Li
|
Zhenghao Lu
|
Xianyi Wei
|
Rui Li
|
Jing Shao
|
Lei Sha
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
With rapid advancement and increasing accessibility of LLMs, fine-tuning aligned models has become a critical step for adapting them to real-world applications, which makes the safety of this fine-tuning process more important than ever. However, recent studies have highlighted a critical challenge: even when fine-tuning with seemingly benign downstream datasets, the safety of aligned LLMs can be compromised, making them more susceptible to malicious instructions. In this paper, we show that fine-tuning datasets often contain samples with safety-degrading features that are not easily identifiable on the surface. These samples can significantly degrade the safety alignment of LLMs during fine-tuning. To address this issue, we propose LARF, a Layer-Aware Representation Filtering method. This method identifies safety-sensitive layers within the LLM and leverages their representations to detect which data samples in the post-training dataset contain safety-degrading features. Experimental results demonstrate that LARF can effectively identify benign data with safety-degrading features. After removing such data, the safety alignment degradation caused by fine-tuning is mitigated.
pdf
bib
abs
How Far are LLMs from Being Our Digital Twins? A Benchmark for Persona-Based Behavior Chain Simulation
Rui Li
|
Heming Xia
|
Xinfeng Yuan
|
Qingxiu Dong
|
Lei Sha
|
Wenjie Li
|
Zhifang Sui
Findings of the Association for Computational Linguistics: ACL 2025
Recently, LLMs have garnered increasing attention across academic disciplines for their potential as human digital twins, virtual proxies designed to replicate individuals and autonomously perform tasks such as decision-making, problem-solving, and reasoning on their behalf.However, current evaluations of LLMs primarily emphasize dialogue simulation while overlooking human behavior simulation, which is crucial for digital twins.To address this gap, we introduce BehaviorChain, the first benchmark for evaluating LLMs’ ability to simulate continuous human behavior.BehaviorChain comprises diverse, high-quality, persona-based behavior chains, totaling 15,846 distinct behaviors across 1,001 unique personas, each with detailed history and profile metadata.For evaluation, we integrate persona metadata into LLMs and employ them to iteratively infer contextually appropriate behaviors within dynamic scenarios provided by BehaviorChain. Comprehensive evaluation results demonstrated that even state-of-the-art models struggle with accurately simulating continuous human behavior.
pdf
bib
abs
Hardware-Aware Parallel Prompt Decoding for Memory-Efficient Acceleration of LLM Inference
Hao Mark Chen
|
Wayne Luk
|
Yiu Ka Fai Cedric
|
Rui Li
|
Konstantin Mishchenko
|
Stylianos Venieris
|
Hongxiang Fan
Findings of the Association for Computational Linguistics: EMNLP 2025
The auto-regressive decoding of Large Language Models (LLMs) results in significant overheads in their hardware performance. While recent research has explored various speculative decoding techniques for multi-token generation, these methods introduce high memory costs from the additional weights and KV cache of separate draft models, limiting efficiency in edge and long-context scenarios. To overcome these limitations in edge-scale LLMs, we propose a novel parallel prompt decoding that requires only runtime memory overhead by employing a unified single model for both speculation and verification. Inspired by the human natural language generation process, PPD approximates outputs generated at future timesteps in parallel by using multiple prompt tokens. Furthermore, we present a hardware-aware two-stage tree pruning algorithm that adaptively optimizes this decoding scheme to fully leverage the computational capacities on different GPUs. Through extensive experiments across LLMs ranging from MobileLlama to Vicuna-13B on a wide range of benchmarks, our approach demonstrates up to 2.49 times speedup. Moreover, our parallel prompt decoding can serve as an orthogonal optimization for synergistic integration with existing speculative decoding, showing up to 1.22 times further speed improvement. To support future development, we have included our code implementation with this submission.
pdf
bib
abs
Beyond Single Frames: Can LMMs Comprehend Implicit Narratives in Comic Strip?
Xiaochen Wang
|
Heming Xia
|
Jialin Song
|
Longyu Guan
|
Qingxiu Dong
|
Rui Li
|
Yixin Yang
|
Yifan Pu
|
Weiyao Luo
|
Yiru Wang
|
Xiangdi Meng
|
Wenjie Li
|
Zhifang Sui
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Multimodal Models (LMMs) have demonstrated strong performance on vision-language benchmarks, yet current evaluations predominantly focus on single-image reasoning. In contrast, real-world scenarios always involve understanding sequences of images. A typical scenario is comic strips understanding, which requires models to perform nuanced visual reasoning beyond surface-level recognition. To address this gap, we introduce STRIPCIPHER , a benchmark designed to evaluate the model ability on understanding implicit narratives in silent comics. STRIPCIPHER is a high-quality, human-annotated dataset featuring fine-grained annotations and comprehensive coverage of varying difficulty levels. It comprises three tasks: visual narrative comprehension, contextual frame prediction, and temporal narrative reordering. % , covering various difficulty. Notably, evaluation results on STRIPCIPHER reveals a significant gap between current LMMs and human performance—e.g., GPT-4o achieves only 23.93% accuracy in the reordering task, 56.07% below human levels. These findings underscore the limitations of current LMMs in implicit visual narrative understanding and highlight opportunities for advancing sequential multimodal reasoning.
2024
pdf
bib
abs
Wonder at Chemotimelines 2024: MedTimeline: An End-to-End NLP System for Timeline Extraction from Clinical Narratives
Liwei Wang
|
Qiuhao Lu
|
Rui Li
|
Sunyang Fu
|
Hongfang Liu
Proceedings of the 6th Clinical Natural Language Processing Workshop
Extracting timeline information from clinical narratives is critical for cancer research and practice using electronic health records (EHRs). In this study, we apply MedTimeline, our end-to-end hybrid NLP system combining large language model, deep learning with knowledge engineering, to the ChemoTimeLine challenge subtasks. Our experiment results in 0.83, 0.90, 0.84, and 0.53, 0.63, 0.39, respectively, for subtask1 and subtask2 in breast, melanoma and ovarian cancer.
pdf
bib
abs
A Survey on In-context Learning
Qingxiu Dong
|
Lei Li
|
Damai Dai
|
Ce Zheng
|
Jingyuan Ma
|
Rui Li
|
Heming Xia
|
Jingjing Xu
|
Zhiyong Wu
|
Baobao Chang
|
Xu Sun
|
Lei Li
|
Zhifang Sui
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
With the increasing capabilities of large language models (LLMs), in-context learning (ICL) has emerged as a new paradigm for natural language processing (NLP), where LLMs make predictions based on contexts augmented with a few examples. It has been a significant trend to explore ICL to evaluate and extrapolate the ability of LLMs. In this paper, we aim to survey and summarize the progress and challenges of ICL. We first present a formal definition of ICL and clarify its correlation to related studies. Then, we organize and discuss advanced techniques, including training strategies, prompt designing strategies, and related analysis. Additionally, we explore various ICL application scenarios, such as data engineering and knowledge updating. Finally, we address the challenges of ICL and suggest potential directions for further research. We hope that our work can encourage more research on uncovering how ICL works and improving ICL.
pdf
bib
abs
When Generative Adversarial Networks Meet Sequence Labeling Challenges
Yu Tong
|
Ge Chen
|
Guokai Zheng
|
Rui Li
|
Jiang Dazhi
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The current framework for sequence labeling encompasses a feature extractor and a sequence tagger. This study introduces a unified framework named SLGAN, which harnesses the capabilities of Generative Adversarial Networks to address the challenges associated with Sequence Labeling tasks. SLGAN not only mitigates the limitation of GANs in backpropagating loss to discrete data but also exhibits strong adaptability to various sequence labeling tasks. Unlike traditional GANs, the discriminator within SLGAN does not discriminate whether data originates from the discriminator or the generator; instead, it focuses on predicting the correctness of each tag within the tag sequence. We conducted evaluations on six different tasks spanning four languages, including Chinese, Japanese, and Korean Word Segmentation, Chinese and English Named Entity Recognition, and Chinese Part-of-Speech Tagging. Our experimental results illustrate that SLGAN represents a versatile and highly effective solution, consistently achieving state-of-the-art or competitive performance results, irrespective of the specific task or language under consideration.
pdf
bib
abs
Be a Multitude to Itself: A Prompt Evolution Framework for Red Teaming
Rui Li
|
Peiyi Wang
|
Jingyuan Ma
|
Di Zhang
|
Lei Sha
|
Zhifang Sui
Findings of the Association for Computational Linguistics: EMNLP 2024
Large Language Models (LLMs) have gained increasing attention for their remarkable capacity, alongside concerns about safety arising from their potential to produce harmful content. Red teaming aims to find prompts that could elicit harmful responses from LLMs, and is essential to discover and mitigate safety risks before real-world deployment. However, manual red teaming is both time-consuming and expensive, rendering it unscalable. In this paper, we propose RTPE, a scalable evolution framework to evolve red teaming prompts across both breadth and depth dimensions, facilitating the automatic generation of numerous high-quality and diverse red teaming prompts. Specifically, in-breadth evolving employs a novel enhanced in-context learning method to create a multitude of quality prompts, whereas in-depth evolving applies customized transformation operations to enhance both content and form of prompts, thereby increasing diversity. Extensive experiments demonstrate that RTPE surpasses existing representative automatic red teaming methods on both attack success rate and diversity. In addition, based on 4,800 red teaming prompts created by RTPE, we further provide a systematic analysis of 8 representative LLMs across 8 sensitive topics.
pdf
bib
abs
ShieldLM: Empowering LLMs as Aligned, Customizable and Explainable Safety Detectors
Zhexin Zhang
|
Yida Lu
|
Jingyuan Ma
|
Di Zhang
|
Rui Li
|
Pei Ke
|
Hao Sun
|
Lei Sha
|
Zhifang Sui
|
Hongning Wang
|
Minlie Huang
Findings of the Association for Computational Linguistics: EMNLP 2024
The safety of Large Language Models (LLMs) has gained increasing attention in recent years, but there still lacks a comprehensive approach for detecting safety issues within LLMs’ responses in an aligned, customizable and explainable manner. In this paper, we propose ShieldLM, an LLM-based safety detector, which aligns with common safety standards, supports customizable detection rules, and provides explanations for its decisions. To train ShieldLM, we compile a large bilingual dataset comprising 14,387 query-response pairs, annotating the safety of responses based on various safety standards. Through extensive experiments, we demonstrate that ShieldLM surpasses strong baselines across four test sets, showcasing remarkable customizability and explainability. Besides performing well on standard detection datasets, ShieldLM has also been shown to be effective as a safety evaluator for advanced LLMs. ShieldLM is released at
https://github.com/thu-coai/ShieldLM to support accurate and explainable safety detection under various safety standards.
pdf
bib
abs
Breakthrough from Nuance and Inconsistency: Enhancing Multimodal Sarcasm Detection with Context-Aware Self-Attention Fusion and Word Weight Calculation.
Hongfei Xue
|
Linyan Xu
|
Yu Tong
|
Rui Li
|
Jiali Lin
|
Dazhi Jiang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Multimodal sarcasm detection has received considerable attention due to its unique role in social networks. Existing methods often rely on feature concatenation to fuse different modalities or model the inconsistencies among modalities. However, sarcasm is often embodied in local and momentary nuances in a subtle way, which causes difficulty for sarcasm detection. To effectively incorporate these nuances, this paper presents Context-Aware Self-Attention Fusion (CAAF) to integrate local and momentary multimodal information into specific words. Furthermore, due to the instantaneous nature of sarcasm, the connotative meanings of words post-multimodal integration generally deviate from their denotative meanings. Therefore, Word Weight Calculation (WWC) is presented to compute the weight of specific words based on CAAF’s fusion nuances, illustrating the inconsistency between connotation and denotation. We evaluate our method on the MUStARD dataset, achieving an accuracy of 76.9 and an F1 score of 76.1, which surpasses the current state-of-the-art IWAN model by 1.7 and 1.6 respectively.
pdf
bib
abs
Feature Structure Matching for Multi-source Sentiment Analysis with Efficient Adaptive Tuning
Rui Li
|
Cheng Liu
|
Yu Tong
|
Jiang Dazhi
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Recently, fine-tuning the large pre-trained language models on the labeled sentiment dataset achieves appealing performance. However, the obtained model may not generalize well to the other domains due to the domain shift, and it is expensive to update the entire parameters within the large models. Although some existing domain matching methods are proposed to alleviate the above issues, there are multiple relevant source domains in practice which makes the whole training more costly and complicated. To this end, we focus on the efficient unsupervised multi-source sentiment adaptation task which is more challenging and beneficial for real-world applications. Specifically, we propose to extract multi-layer features from the large pre-trained model, and design a dynamic parameters fusion module to exploit these features for both efficient and adaptive tuning. Furthermore, we propose a novel feature structure matching constraint, which enforces similar feature-wise correlations across different domains. Compared with the traditional domain matching methods which tend to pull all feature instances close, we show that the proposed feature structure matching is more robust and generalizable in the multi-source scenario. Extensive experiments on several multi-source sentiment analysis benchmarks demonstrate the effectiveness and superiority of our proposed framework.
2023
pdf
bib
abs
To Copy Rather Than Memorize: A Vertical Learning Paradigm for Knowledge Graph Completion
Rui Li
|
Xu Chen
|
Chaozhuo Li
|
Yanming Shen
|
Jianan Zhao
|
Yujing Wang
|
Weihao Han
|
Hao Sun
|
Weiwei Deng
|
Qi Zhang
|
Xing Xie
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Embedding models have shown great power in knowledge graph completion (KGC) task. By learning structural constraints for each training triple, these methods implicitly memorize intrinsic relation rules to infer missing links. However, this paper points out that the multi-hop relation rules are hard to be reliably memorized due to the inherent deficiencies of such implicit memorization strategy, making embedding models underperform in predicting links between distant entity pairs. To alleviate this problem, we present Vertical Learning Paradigm (VLP), which extends embedding models by allowing to explicitly copy target information from related factual triples for more accurate prediction. Rather than solely relying on the implicit memory, VLP directly provides additional cues to improve the generalization ability of embedding models, especially making the distant link prediction significantly easier. Moreover, we also propose a novel relative distance based negative sampling technique (ReD) for more effective optimization. Experiments demonstrate the validity and generality of our proposals on two standard benchmarks. Our code is available at
https://github.com/rui9812/VLP.
pdf
bib
abs
Learning More from Mixed Emotions: A Label Refinement Method for Emotion Recognition in Conversations
Jintao Wen
|
Geng Tu
|
Rui Li
|
Dazhi Jiang
|
Wenhua Zhu
Transactions of the Association for Computational Linguistics, Volume 11
One-hot labels are commonly employed as ground truth in Emotion Recognition in Conversations (ERC). However, this approach may not fully encompass all the emotions conveyed in a single utterance, leading to suboptimal performance. Regrettably, current ERC datasets lack comprehensive emotionally distributed labels. To address this issue, we propose the Emotion Label Refinement (EmoLR) method, which utilizes context- and speaker-sensitive information to infer mixed emotional labels. EmoLR comprises an Emotion Predictor (EP) module and a Label Refinement (LR) module. The EP module recognizes emotions and provides context/speaker states for the LR module. Subsequently, the LR module calculates the similarity between these states and ground-truth labels, generating a refined label distribution (RLD). The RLD captures a more comprehensive range of emotions than the original one-hot labels. These refined labels are then used for model training in place of the one-hot labels. Experimental results on three public conversational datasets demonstrate that our EmoLR achieves state-of-the-art performance.
2022
pdf
bib
abs
Asymmetric Mutual Learning for Multi-source Unsupervised Sentiment Adaptation with Dynamic Feature Network
Rui Li
|
Cheng Liu
|
Dazhi Jiang
Proceedings of the 29th International Conference on Computational Linguistics
Recently, fine-tuning the pre-trained language model (PrLM) on labeled sentiment datasets demonstrates impressive performance. However, collecting labeled sentiment dataset is time-consuming, and fine-tuning the whole PrLM brings about much computation cost. To this end, we focus on multi-source unsupervised sentiment adaptation problem with the pre-trained features, which is more practical and challenging. We first design a dynamic feature network to fully exploit the extracted pre-trained features for efficient domain adaptation. Meanwhile, with the difference of the traditional source-target domain alignment methods, we propose a novel asymmetric mutual learning strategy, which can robustly estimate the pseudo-labels of the target domain with the knowledge from all the other source models. Experiments on multiple sentiment benchmarks show that our method outperforms the recent state-of-the-art approaches, and we also conduct extensive ablation studies to verify the effectiveness of each the proposed module.
2021
pdf
bib
abs
Treasures Outside Contexts: Improving Event Detection via Global Statistics
Rui Li
|
Wenlin Zhao
|
Cheng Yang
|
Sen Su
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Event detection (ED) aims at identifying event instances of specified types in given texts, which has been formalized as a sequence labeling task. As far as we know, existing neural-based ED models make decisions relying entirely on the contextual semantic features of each word in the inputted text, which we find is easy to be confused by the varied contexts in the test stage. To this end, we come up with the idea of introducing a set of statistical features from word-event co-occurrence frequencies in the entire training set to cooperate with contextual features. Specifically, we propose a Semantic and Statistic-Joint Discriminative Network (SS-JDN) consisting of a semantic feature extractor, a statistical feature extractor, and a joint event discriminator. In experiments, SS-JDN effectively exceeds ten recent strong baselines on ACE2005 and KBP2015 datasets. Further, we perform extensive experiments to comprehensively probe SS-JDN.
2016
pdf
bib
Multi-Granularity Chinese Word Embedding
Rongchao Yin
|
Quan Wang
|
Peng Li
|
Rui Li
|
Bin Wang
Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing
2012
pdf
bib
Annotation Schemes to Encode Domain Knowledge in Medical Narratives
Wilson McCoy
|
Cecilia Ovesdotter Alm
|
Cara Calvelli
|
Rui Li
|
Jeff B. Pelz
|
Pengcheng Shi
|
Anne Haake
Proceedings of the Sixth Linguistic Annotation Workshop