2025
pdf
bib
abs
UAlign: Leveraging Uncertainty Estimations for Factuality Alignment on Large Language Models
Boyang Xue
|
Fei Mi
|
Qi Zhu
|
Hongru Wang
|
Rui Wang
|
Sheng Wang
|
Erxin Yu
|
Xuming Hu
|
Kam-Fai Wong
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite demonstrating impressive capabilities, Large Language Models (LLMs) still often struggle to accurately express the factual knowledge they possess, especially in cases where the LLMs’ knowledge boundaries are ambiguous. To improve LLMs’ factual expressions, we propose the UAlign framework, which leverages Uncertainty estimations to represent knowledge boundaries, and then explicitly incorporates these representations as input features into prompts for LLMs to Align with factual knowledge. First, we prepare the dataset on knowledge question-answering (QA) samples by calculating two uncertainty estimations, including confidence score and semantic entropy, to represent the knowledge boundaries for LLMs. Subsequently, using the prepared dataset, we train a reward model that incorporates uncertainty estimations and then employ the Proximal Policy Optimization (PPO) algorithm for factuality alignment on LLMs. Experimental results indicate that, by integrating uncertainty representations in LLM alignment, the proposed UAlign can significantly enhance the LLMs’ capacities to confidently answer known questions and refuse unknown questions on both in-domain and out-of-domain tasks, showing reliability improvements and good generalizability over various prompt- and training-based baselines.
pdf
bib
abs
Can LLMs Evaluate Complex Attribution in QA? Automatic Benchmarking using Knowledge Graphs
Nan Hu
|
Jiaoyan Chen
|
Yike Wu
|
Guilin Qi
|
Hongru Wang
|
Sheng Bi
|
Yongrui Chen
|
Tongtong Wu
|
Jeff Z. Pan
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Attributed Question Answering (AQA) has attracted wide attention, but there are still several limitations in evaluating the attributions, including lacking fine-grained attribution categories, relying on manual annotations, and failing to compare attributions with only subtle differences. To bridge these gaps, we introduce Complex Attributed Question Answering (CAQA), a large-scale benchmark containing comprehensive attribution categories, automatically generated using Knowledge Graphs (KGs), and complex attribution scenarios. We have conducted extensive experiments to verify the effectiveness of CAQA, including the benchmarking of 25 automatic evaluators, their comparison with human evaluators, the testing of LLM evaluators fine-tuned by CAQA and so on. These experiments also lead to a series of important findings that can benefit the future research of AQA.
pdf
bib
abs
ImPart: Importance-Aware Delta-Sparsification for Improved Model Compression and Merging in LLMs
Yan Yang
|
Yixia Li
|
Hongru Wang
|
Xuetao Wei
|
James Jianqiao Yu
|
Yun Chen
|
Guanhua Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the proliferation of task-specific large language models, delta compression has emerged as a method to mitigate the resource challenges of deploying numerous such models by effectively compressing the delta model parameters. Previous delta-sparsification methods either remove parameters randomly or truncate singular vectors directly after singular value decomposition (SVD). However, these methods either disregard parameter importance entirely or evaluate it with too coarse a granularity. In this work, we introduce ImPart, a novel importance-aware delta sparsification approach. Leveraging SVD, it dynamically adjusts sparsity ratios of different singular vectors based on their importance, effectively retaining crucial task-specific knowledge even at high sparsity ratios. Experiments show that ImPart achieves state-of-the-art delta sparsification performance, demonstrating 2× higher compression ratio than baselines at the same performance level. When integrated with existing methods, ImPart sets a new state-of-the-art on delta quantization and model merging.
pdf
bib
abs
NILE: Internal Consistency Alignment in Large Language Models
Minda Hu
|
Qiyuan Zhang
|
Yufei Wang
|
Bowei He
|
Hongru Wang
|
Jingyan Zhou
|
Liangyou Li
|
Yasheng Wang
|
Chen Ma
|
Irwin King
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Recent advances show that the world knowledge in the Instruction Fine-Tuning (IFT) dataset, which is incompatible with LLMs’ internal knowledge, can greatly hurt the IFT performance. However, the effective integration and balancing of the internal knowledge of LLMs, acquired during pre-training, with existing IFT datasets remains a largely underexplored area of research. To address this gap, this work introduces NILE, a novel framework to optimize the effectiveness of IFT by adjusting IFT datasets through carefully aligning the world and internal knowledge. NILE employs a three-stage pipeline to effectively quantify and adjust consistency with the internal knowledge of target LLMs. Our analysis provides compelling evidence that balancing such consistency with pre-trained internal knowledge is pivotal for unleashing LLM potential, and confirms that NILE can systematically contribute to these substantial performance improvements. Experimental results demonstrate that NILE-aligned IFT datasets sharply boost LLM performance across multiple LLM ability evaluation datasets, achieving up to 66.6% gain on Arena-Hard and 68.5% on Alpaca-Eval V2.
pdf
bib
abs
MlingConf: A Comprehensive Study of Multilingual Confidence Estimation on Large Language Models
Boyang Xue
|
Hongru Wang
|
Rui Wang
|
Sheng Wang
|
Zezhong Wang
|
Yiming Du
|
Bin Liang
|
Wenxuan Zhang
|
Kam-Fai Wong
Findings of the Association for Computational Linguistics: ACL 2025
The tendency of Large Language Models (LLMs) to generate hallucinations raises concerns regarding their reliability. Therefore, confidence estimations indicating the extent of trustworthiness of the generations become essential. However, current LLM confidence estimations in languages other than English remain underexplored. This paper addresses this gap by introducing a comprehensive investigation of Multilingual Confidence estimation (MlingConf) on LLMs, focusing on both language-agnostic (LA) and language-specific (LS) tasks to explore the performance and language dominance effects of multilingual confidence estimations on different tasks. The benchmark comprises four meticulously checked and human-evaluated high-quality multilingual datasets for LA tasks and one for the LS task tailored to specific social, cultural, and geographical contexts of a language. Our experiments reveal that on LA tasks English exhibits notable linguistic dominance in confidence estimations than other languages, while on LS tasks, using question-related language to prompt LLMs demonstrates better linguistic dominance in multilingual confidence estimations. The phenomena inspire a simple yet effective native-tone prompting strategy by employing language-specific prompts for LS tasks, effectively improving LLMs’ reliability and accuracy in LS scenarios.
pdf
bib
abs
SMART: Self-Aware Agent for Tool Overuse Mitigation
Cheng Qian
|
Emre Can Acikgoz
|
Hongru Wang
|
Xiusi Chen
|
Avirup Sil
|
Dilek Hakkani-Tür
|
Gokhan Tur
|
Heng Ji
Findings of the Association for Computational Linguistics: ACL 2025
Current Large Language Model (LLM) agents demonstrate strong reasoning and tool use capabilities, but often lack self-awareness, failing to balance these approaches effectively. This imbalance leads to **Tool Overuse**, where models unnecessarily rely on external tools for tasks solvable with parametric knowledge, increasing computational overhead. Inspired by human metacognition, we introduce **SMART** (Strategic Model-Aware Reasoning with Tools), a paradigm that enhances an agent’s self-awareness to optimize task handling and reduce tool overuse. To support this paradigm, we introduce **SMART-ER**, a dataset spanning three domains, where reasoning alternates between parametric knowledge and tool-dependent steps, with each step enriched by rationales explaining when tools are necessary. Through supervised training, we develop **SMARTAgent**, a family of models that dynamically balance parametric knowledge and tool use. Evaluations show that SMARTAgent reduces tool use by 24% while improving performance by over 37%, enabling 7B-scale models to match its 70B counterpart and GPT-4. Additionally, SMARTAgent generalizes to out-of-distribution test data like GSM8K and MINTQA, maintaining accuracy with just one-fifth the tool calls. These highlight the potential of strategic tool use to enhance reasoning, mitigate overuse, and bridge the gap between model size and performance, advancing intelligent and resource-efficient agent designs.
pdf
bib
abs
Rethinking Stateful Tool Use in Multi-Turn Dialogues: Benchmarks and Challenges
Hongru Wang
|
Wenyu Huang
|
Yufei Wang
|
Yuanhao Xi
|
Jianqiao Lu
|
Huan Zhang
|
Nan Hu
|
Zeming Liu
|
Jeff Z. Pan
|
Kam-Fai Wong
Findings of the Association for Computational Linguistics: ACL 2025
Existing benchmarks that assess Language Models (LMs) as Language Agents (LAs) for tool use primarily focus on stateless, single-turn interactions or partial evaluations, such as tool selection in a single turn, overlooking the inherent stateful nature of interactions in multi-turn applications. To fulfill this gap, we propose DialogTool, a multi-turn dialogue dataset with stateful tool interactions considering the whole life cycle of tool use, across six key tasks in three stages: 1) tool creation; 2) tool utilization: tool awareness, tool selection, tool execution; and 3) role-consistent response: response generation and role play. Furthermore, we build VirtualMobile – an embodied virtual mobile evaluation environment to simulate API calls and assess the robustness of the created APIs. Taking advantage of these artifacts, we conduct comprehensive evaluation on 13 distinct open- and closed-source LLMs and provide detailed analysis at each stage, revealing that the existing state-of-the-art LLMs still cannot perform well to use tools over long horizons .
pdf
bib
abs
Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst
Hongru Wang
|
Deng Cai
|
Wanjun Zhong
|
Shijue Huang
|
Jeff Z. Pan
|
Zeming Liu
|
Kam-Fai Wong
Findings of the Association for Computational Linguistics: ACL 2025
Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition such as reflection and decomposition, being difficult to create and acquire. In this work, we introduce Self-Reasoning Language Model (SRLM), where the model itself can synthesize longer CoT data and iteratively improve performance through self-training. By incorporating a few demonstration examples (i.e., 1,000 samples) on how to unfold hidden reasoning chains from existing responses, which act as a reasoning catalyst, we demonstrate that SRLM not only enhances the model’s initial performance but also ensures more stable and consistent improvements in subsequent iterations. Our proposed SRLM achieves an average absolute improvement of more than +2.5 points across five reasoning tasks: MMLU, GSM8K, ARC-C, HellaSwag, and BBH on two backbone models. Moreover, it brings more improvements with more times of sampling during inference, such as absolute +7.89 average improvement with 64 sampling times, revealing the in-depth, diverse and creative reasoning paths in SRLM against the strong baseline .
pdf
bib
abs
Self-Reasoning Language Models: Unfold Hidden Reasoning Chains with Few Reasoning Catalyst
Hongru Wang
|
Deng Cai
|
Wanjun Zhong
|
Shijue Huang
|
Jeff Z. Pan
|
Zeming Liu
|
Kam-Fai Wong
Findings of the Association for Computational Linguistics: ACL 2025
Inference-time scaling has attracted much attention which significantly enhance the performance of Large Language Models (LLMs) in complex reasoning tasks by increasing the length of Chain-of-Thought. These longer intermediate reasoning rationales embody various meta-reasoning skills in human cognition such as reflection and decomposition, being difficult to create and acquire. In this work, we introduce Self-Reasoning Language Model (SRLM), where the model itself can synthesize longer CoT data and iteratively improve performance through self-training. By incorporating a few demonstration examples (i.e., 1,000 samples) on how to unfold hidden reasoning chains from existing responses, which act as a reasoning catalyst, we demonstrate that SRLM not only enhances the model’s initial performance but also ensures more stable and consistent improvements in subsequent iterations. Our proposed SRLM achieves an average absolute improvement of more than +2.5 points across five reasoning tasks: MMLU, GSM8K, ARC-C, HellaSwag, and BBH on two backbone models. Moreover, it brings more improvements with more times of sampling during inference, such as absolute +7.89 average improvement with 64 sampling times, revealing the in-depth, diverse and creative reasoning paths in SRLM against the strong baseline .
pdf
bib
abs
TransBench: Breaking Barriers for Transferable Graphical User Interface Agents in Dynamic Digital Environments
Yuheng Lu
|
Qian Yu
|
Hongru Wang
|
Zeming Liu
|
Wei Su
|
Yanping Liu
|
Yuhang Guo
|
Maocheng Liang
|
Yunhong Wang
|
Haifeng Wang
Findings of the Association for Computational Linguistics: ACL 2025
Graphical User Interface (GUI) agents, which autonomously operate on digital interfaces through natural language instructions, hold transformative potential for accessibility, automation, and user experience. A critical aspect of their functionality is grounding — the ability to map linguistic intents to visual and structural interface elements. However, existing GUI agents often struggle to adapt to the dynamic and interconnected nature of real-world digital environments, where tasks frequently span multiple platforms and applications while also being impacted by version updates. To address this, we introduce TransBench, the first benchmark designed to systematically evaluate and enhance the transferability of GUI agents across three key dimensions: cross-version transferability (adapting to version updates), cross-platform transferability (generalizing across platforms like iOS, Android, and Web), and cross-application transferability (handling tasks spanning functionally distinct apps). TransBench includes 15 app categories with diverse functionalities, capturing essential pages across versions and platforms to enable robust evaluation. Our experiments demonstrate significant improvements in grounding accuracy, showcasing the practical utility of GUI agents in dynamic, real-world environments. Our code and data will be publicly available at GitHub.
pdf
bib
abs
ToolSpectrum: Towards Personalized Tool Utilization for Large Language Models
Zihao Cheng
|
Hongru Wang
|
Zeming Liu
|
Yuhang Guo
|
Yuanfang Guo
|
Yunhong Wang
|
Haifeng Wang
Findings of the Association for Computational Linguistics: ACL 2025
While integrating external tools into large language models (LLMs) enhances their ability to access real-time information and domain-specific services, existing approaches focus narrowly on functional tool selection following user instructions while overlooking the critical role of context-aware personalization in tool selection. This oversight leads to suboptimal user satisfaction and inefficient tool utilization, particularly when overlapping toolsets require nuanced selection based on contextual factors. To bridge this gap, we introduce ToolSpectrum, a benchmark designed to evaluate LLMs’ capabilities in personalized tool utilization. Specifically, we formalize two key dimensions of personalization, user profile and environmental factors, and analyze their individual and synergistic impacts on tool selection. Through extensive experiments on ToolSpectrum, we demonstrate that personalized tool selection significantly improves user experience across diverse scenarios. However, even state-of-the-art LLMs exhibit the limited ability to reason jointly about user profiles and environmental factors, often prioritizing one dimension at the expense of the other. Our findings underscore the necessity of context-aware personalization in tool-augmented LLMs and reveal critical limitations for current models. Our data and code will be released soon.
pdf
bib
abs
ModelingAgent: Bridging LLMs and Mathematical Modeling for Real-World Challenges
Cheng Qian
|
Hongyi Du
|
Hongru Wang
|
Xiusi Chen
|
Yuji Zhang
|
Avirup Sil
|
ChengXiang Zhai
|
Kathleen McKeown
|
Heng Ji
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent progress in large language models (LLMs) has enabled substantial advances in solving mathematical problems. However, existing benchmarks often fail to reflect real-world complexity, which demand open-ended, interdisciplinary reasoning and integration of computational tools. To address this gap, we introduce **ModelingBench**, a novel benchmark featuring real-world-inspired, open-ended problems from math modeling competitions across diverse domains, ranging from urban traffic optimization to ecosystem resource planning. These tasks require translating natural language into formal mathematical formulations, applying appropriate tools, and producing structured, defensible reports. ModelingBench supports multiple valid solutions, capturing the ambiguity and creativity of practical modeling. To solve these challenges, we present **ModelingAgent**, a multi-agent framework that coordinates tool use, supports structured workflows, and enables iterative self-refinement to generate well-grounded, creative solutions. Empirical results show that ModelingAgent substantially outperforms strong baselines and often produces solutions indistinguishable from those of human experts. Together, our work provides a comprehensive framework for evaluating and advancing real-world problem-solving in open-ended, interdisciplinary modeling challenges. All the codes are released for future research.
pdf
bib
abs
DecisionFlow: Advancing Large Language Model as Principled Decision Maker
Xiusi Chen
|
Shanyong Wang
|
Cheng Qian
|
Hongru Wang
|
Peixuan Han
|
Heng Ji
Findings of the Association for Computational Linguistics: EMNLP 2025
In high-stakes domains such as healthcare and finance, effective decision-making demands not just accurate outcomes but transparent and explainable reasoning. However, current language models often lack the structured deliberation needed for such tasks, instead generating decisions and justifications in a disconnected, post-hoc manner. To address this, we propose DecisionFlow, a novel decision modeling framework that guides models to reason over structured representations of actions, attributes, and constraints. Rather than predicting answers directly from prompts, DecisionFlow builds a semantically grounded decision space and infers a latent utility function to evaluate trade-offs in a transparent, utility-driven manner. This process produces decisions tightly coupled with interpretable rationales reflecting the model’s reasoning. Empirical results on two high-stakes benchmarks show that DecisionFlow not only achieves up to 30% accuracy gains over strong prompting baselines but also enhances alignment in outcomes. Our work is a critical step toward integrating symbolic reasoning with LLMs, enabling more accountable, explainable, and reliable LLM decision support systems. Code and data are at https://github.com/xiusic/DecisionFlow.
pdf
bib
abs
SafeToolBench: Pioneering a Prospective Benchmark to Evaluating Tool Utilization Safety in LLMs
Hongfei Xia
|
Hongru Wang
|
Zeming Liu
|
Qian Yu
|
Yuhang Guo
|
Haifeng Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) have exhibited great performance in autonomously calling various tools in external environments, leading to better problems solving and task automation capabilities. However, these external tools also amplify potential risks such as financial loss or privacy leaking with ambiguous or malicious user instructions. Compared to previous studies, which mainly assess the safety awareness of LLMs after obtaining the tool execution results (i.e., retrospective evaluation), this paper focuses on prospective ways to assess the safety of LLM tool utilization, aiming to avoid irreversible harm caused by directly executing tools. To this end, we propose SafeToolBench, the first benchmark to comprehensively assess tool utilization security in a prospective manner, covering malicious user instructions and diverse practical toolsets. Additionally, we propose a novel framework, SafeInstructTool, which aims to enhance LLMs’ awareness of tool utilization security through three perspectives (i.e., User Instruction, Tool Itself, and Joint Instruction-Tool), leading to nine detailed dimensions in total. We experiment with four LLMs using different methods, revealing that existing approaches fail to fully capture all risks in tool utilization. In contrast, our framework significantly enhances LLMs’ self-awareness, enabling a more safer and trustworthy tool utilization.
pdf
bib
abs
SeqAR: Jailbreak LLMs with Sequential Auto-Generated Characters
Yan Yang
|
Zeguan Xiao
|
Xin Lu
|
Hongru Wang
|
Xuetao Wei
|
Hailiang Huang
|
Guanhua Chen
|
Yun Chen
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
The widespread applications of large language models (LLMs) have brought about concerns regarding their potential misuse. Although aligned with human preference data before release, LLMs remain vulnerable to various malicious attacks. In this paper, we adopt a red-teaming strategy to enhance LLM safety and introduce SeqAR, a simple yet effective framework to design jailbreak prompts automatically. The SeqAR framework generates and optimizes multiple jailbreak characters and then applies sequential jailbreak characters in a single query to bypass the guardrails of the target LLM. Different from previous work which relies on proprietary LLMs or seed jailbreak templates crafted by human expertise, SeqAR can generate and optimize the jailbreak prompt in a cold-start scenario using open-sourced LLMs without any seed jailbreak templates. Experimental results show that SeqAR achieves attack success rates of 88% and 60% in bypassing the safety alignment of GPT-3.5-1106 and GPT-4, respectively. Furthermore, we extensively evaluate the transferability of the generated templates across different LLMs and held-out malicious requests, while also exploring defense strategies against the jailbreak attack designed by SeqAR.
pdf
bib
abs
Steering Knowledge Selection Behaviours in LLMs via SAE-Based Representation Engineering
Yu Zhao
|
Alessio Devoto
|
Giwon Hong
|
Xiaotang Du
|
Aryo Pradipta Gema
|
Hongru Wang
|
Xuanli He
|
Kam-Fai Wong
|
Pasquale Minervini
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Large language models (LLMs) can store a significant amount of factual knowledge in their parameters. However, their parametric knowledge may conflict with the information provided in the context—this phenomenon, known as context-memory knowledge conflicts, can lead to undesirable model behaviour, such as reliance on outdated or incorrect information. Analysing the internal activations of LLMs, we find that they can internally register the signals of knowledge conflict at mid-layers. Such signals allow us to detect whether a knowledge conflict occurs and use inference-time intervention strategies to resolve it. In this work, we propose SpARE, a training-free representation engineering method that uses pre-trained sparse auto-encoders (SAEs) to control the knowledge selection behaviour of LLMs. SpARE identifies the functional features that control the knowledge selection behaviours and applies them to edit the internal activations of LLMs at inference time. Our experimental results show that SpARE can effectively control the usage of either knowledge source to resolve knowledge conflict in open-domain question-answering tasks, surpassing existing representation engineering methods (+10%) as well as contrastive decoding methods (+15%).
pdf
bib
abs
Self-DC: When to Reason and When to Act? Self Divide-and-Conquer for Compositional Unknown Questions
Hongru Wang
|
Boyang Xue
|
Baohang Zhou
|
Tianhua Zhang
|
Cunxiang Wang
|
Huimin Wang
|
Guanhua Chen
|
Kam-Fai Wong
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Previous research has typically concentrated on leveraging the internal knowledge of Large Language Models (LLMs) to answer known questions (i.e., internal reasoning such as generate-then-read). In contrast, for questions that fall outside their known scope, these models rely on external knowledge retrieval to provide accurate responses (i.e., external acting such as retrieve-then-read). However, few previous works consider the compositional questions, which consist of several known and unknown sub-questions, necessitating the dynamic combination of previous two methods (i.e., internal reasoning and external acting) to achieve a better trade-off between effectiveness and efficiency. To this end, we introduce a Self Divide-and-Conquer (Self-DC) framework, accompanying with the first Compositional unknown Question-Answering dataset (CuQA). This framework enables LLMs to adaptively choose between using internal knowledge and retrieving external knowledge as needed, resulting in a better trade-off between effectiveness and efficiency. Experimental results on two datasets demonstrate that Self-DC can achieve comparable or even better performance with much fewer external calls compared with several strong baselines.
2024
pdf
bib
abs
FollowBench: A Multi-level Fine-grained Constraints Following Benchmark for Large Language Models
Yuxin Jiang
|
Yufei Wang
|
Xingshan Zeng
|
Wanjun Zhong
|
Liangyou Li
|
Fei Mi
|
Lifeng Shang
|
Xin Jiang
|
Qun Liu
|
Wei Wang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The ability to follow instructions is crucial for Large Language Models (LLMs) to handle various real-world applications. Existing benchmarks primarily focus on evaluating pure response quality, rather than assessing whether the response follows constraints stated in the instruction. To fill this research gap, in this paper, we propose FollowBench, a Multi-level Fine-grained Constraints Following Benchmark for LLMs. FollowBench comprehensively includes five different types (i.e., Content, Situation, Style, Format, and Example) of fine-grained constraints. To enable a precise constraint following estimation on diverse difficulties, we introduce a Multi-level mechanism that incrementally adds a single constraint to the initial instruction at each increased level. To assess whether LLMs’ outputs have satisfied every individual constraint, we propose to prompt strong LLMs with constraint-evolution paths to handle challenging open-ended instructions. By evaluating 13 closed-source and open-source popular LLMs on FollowBench, we highlight the weaknesses of LLMs in instruction following and point towards potential avenues for future work. The data and code are publicly available at https://github.com/YJiangcm/FollowBench.
pdf
bib
abs
Learning to Edit: Aligning LLMs with Knowledge Editing
Yuxin Jiang
|
Yufei Wang
|
Chuhan Wu
|
Wanjun Zhong
|
Xingshan Zeng
|
Jiahui Gao
|
Liangyou Li
|
Xin Jiang
|
Lifeng Shang
|
Ruiming Tang
|
Qun Liu
|
Wei Wang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Knowledge editing techniques, aiming to efficiently modify a minor proportion of knowledge in large language models (LLMs) without negatively impacting performance across other inputs, have garnered widespread attention. However, existing methods predominantly rely on memorizing the updated knowledge, impeding LLMs from effectively combining the new knowledge with their inherent knowledge when answering questions. To this end, we propose a Learning to Edit (LTE) framework, focusing on teaching LLMs to apply updated knowledge into input questions, inspired by the philosophy of “Teach a man to fish.” LTE features a two-phase process: (i) the Alignment Phase, which fine-tunes LLMs on a meticulously curated parallel dataset to make reliable, in-scope edits while preserving out-of-scope information and linguistic proficiency; and (ii) the Inference Phase, which employs a retrieval-based mechanism for real-time and mass knowledge editing. By comparing our approach with seven advanced baselines across four popular knowledge editing benchmarks and two LLM architectures, we demonstrate LTE’s superiority in knowledge editing performance, robustness in both batch and sequential editing, minimal interference on general tasks, and rapid editing speeds. The data and code are publicly available at https://github.com/YJiangcm/LTE.
pdf
bib
abs
Knowledge Conflicts for LLMs: A Survey
Rongwu Xu
|
Zehan Qi
|
Zhijiang Guo
|
Cunxiang Wang
|
Hongru Wang
|
Yue Zhang
|
Wei Xu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
This survey provides an in-depth analysis of knowledge conflicts for large language models (LLMs), highlighting the complex challenges they encounter when blending contextual and parametric knowledge. Our focus is on three categories of knowledge conflicts: context-memory, inter-context, and intra-memory conflict. These conflicts can significantly impact the trustworthiness and performance of LLMs, especially in real-world applications where noise and misinformation are common. By categorizing these conflicts, exploring the causes, examining the behaviors of LLMs under such conflicts, and reviewing available solutions, this survey aims to shed light on strategies for improving the robustness of LLMs, thereby serving as a valuable resource for advancing research in this evolving area.
pdf
bib
abs
VLEU: a Method for Automatic Evaluation for Generalizability of Text-to-Image Models
Jingtao Cao
|
Zhang Zheng
|
Hongru Wang
|
Kam-Fai Wong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Progress in Text-to-Image (T2I) models has significantly advanced the generation of images from textual descriptions. Existing metrics, such as CLIP, effectively measure the semantic alignment between single prompts and their corresponding images. However, they fall short in evaluating a model’s ability to generalize across a broad spectrum of textual inputs. To address this gap, we propose the VLEU (
Visual
Language
Evaluation
Understudy) metric. VLEU leverages the power of Large Language Models (LLMs) to sample from the visual text domain, encompassing the entire range of potential inputs for the T2I task, to generate a wide variety of visual text. The images generated by T2I models from these prompts are then assessed for their alignment with the input text using the CLIP model. VLEU quantitatively measures a model’s generalizability by computing the Kullback-Leibler (KL) divergence between the visual text marginal distribution and the conditional distribution over the images generated by the model. This provides a comprehensive metric for comparing the overall generalizability of T2I models, beyond single-prompt evaluations, and offers valuable insights during the finetuning process. Our experimental results demonstrate VLEU’s effectiveness in evaluating the generalizability of various T2I models, positioning it as an essential metric for future research and development in image synthesis from text prompts. Our code and data will be publicly available at
https://github.com/mio7690/VLEU.
pdf
bib
abs
AppBench: Planning of Multiple APIs from Various APPs for Complex User Instruction
Hongru Wang
|
Rui Wang
|
Boyang Xue
|
Heming Xia
|
Jingtao Cao
|
Zeming Liu
|
Jeff Z. Pan
|
Kam-Fai Wong
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large Language Models (LLMs) can interact with the real world by connecting with versatile external APIs, resulting in better problem-solving and task automation capabilities. Previous research primarily either focuses on APIs with limited arguments from a single source or overlooks the complex dependency relationship between different APIs. However, it is essential to utilize multiple APIs collaboratively from various sources, especially for complex user instructions. In this paper, we introduce MetaBench, the first benchmark to evaluate LLMs’ ability to plan and execute multiple APIs from various sources in order to complete the user’s task. Specifically, we consider two significant challenges in multiple APIs: 1) graph structures: some APIs can be executed independently while others need to be executed one by one, resulting in graph-like execution order; and 2) permission constraints: which source is authorized to execute the API call. We have experimental results on 9 distinct LLMs; e.g., GPT-4o achieves only a 2.0% success rate at the most complex instruction, revealing that the existing state-of-the-art LLMs still cannot perform well in this situation even with the help of in-context learning and finetuning. Our code and data are publicly available at
https://github.com/ruleGreen/AppBench.
pdf
bib
abs
Role Prompting Guided Domain Adaptation with General Capability Preserve for Large Language Models
Rui Wang
|
Fei Mi
|
Yi Chen
|
Boyang Xue
|
Hongru Wang
|
Qi Zhu
|
Kam-Fai Wong
|
Ruifeng Xu
Findings of the Association for Computational Linguistics: NAACL 2024
The growing interest in Large Language Models (LLMs) for specialized applications has revealed a significant challenge: when tailored to specific domains, LLMs tend to experience catastrophic forgetting, compromising their general capabilities and leading to a suboptimal user experience. Additionally, crafting a versatile model for multiple domains simultaneously often results in a decline in overall performance due to confusion between domains. In response to these issues, we present the RolE Prompting Guided Multi-Domain Adaptation (REGA) strategy. This novel approach effectively manages multi-domain LLM adaptation through three key components: 1) Self-Distillation constructs and replays general-domain exemplars to alleviate catastrophic forgetting. 2) Role Prompting assigns a central prompt to the general domain and a unique role prompt to each specific domain to minimize inter-domain confusion during training. 3) Role Integration reuses and integrates a small portion of domain-specific data to the general-domain data, which are trained under the guidance of the central prompt. The central prompt is used for a streamlined inference process, removing the necessity to switch prompts for different domains.Empirical results demonstrate that REGA effectively alleviates catastrophic forgetting and inter-domain confusion. This leads to improved domain-specific performance compared to standard fine-tuned models, while still preserving robust general capabilities.
pdf
bib
abs
AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models
Wanjun Zhong
|
Ruixiang Cui
|
Yiduo Guo
|
Yaobo Liang
|
Shuai Lu
|
Yanlin Wang
|
Amin Saied
|
Weizhu Chen
|
Nan Duan
Findings of the Association for Computational Linguistics: NAACL 2024
Assessing foundation models’ abilities for human-level tasks is crucial for Artificial General Intelligence (AGI) development.Traditional benchmarks, which rely on artificial datasets, may not accurately represent these capabilities. In this paper, we introduce AGIEval, a novel bilingual benchmark designed to assess foundation models in the context of human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. We evaluate several state-of-the-art foundation models on our benchmark. Impressively, we show that GPT-4 exceeds the average human performance in SAT, LSAT, and math contests, with 95% accuracy on SAT Math and 92.5% on the Chinese college entrance English exam. This demonstrates the exceptional performance of contemporary foundation models. In contrast, we also find that GPT-4 is less proficient in tasks requiring complex reasoning or specific domain knowledge. Our comprehensive analyses of model capabilities (understanding, knowledge, reasoning, and calculation) reveal their strengths and limitations, providing valuable insights into future directions for enhancing general capabilities. By concentrating on tasks pertinent to human cognition and decision-making, our benchmark delivers a meaningful and robust evaluation of foundation models’ performance in real-world scenarios.
pdf
bib
abs
DPDLLM: A Black-box Framework for Detecting Pre-training Data from Large Language Models
Baohang Zhou
|
Zezhong Wang
|
Lingzhi Wang
|
Hongru Wang
|
Ying Zhang
|
Kehui Song
|
Xuhui Sui
|
Kam-Fai Wong
Findings of the Association for Computational Linguistics: ACL 2024
The success of large language models (LLM) benefits from large-scale model parameters and large amounts of pre-training data. However, the textual data for training LLM can not be confirmed to be legal because they are crawled from different web sites. For example, there are copyrighted articles, personal reviews and information in the pre-training data for LLM which are illegal. To address the above issue and develop legal LLM, we propose to detect the pre-training data from LLM in a pure black-box way because the existing LLM services only return the generated text. The previous most related works are the membership inference attack (MIA) on machine learning models to detect the training data from them. But the existing methods are based on analyzing the output probabilities of models which are unrealistic to LLM services. To tackle the problem, we firstly construct the benchmark datasets by collecting textual data from different domains as the seen and unseen pre-training data for LLMs. Then, we investigate a black-box framework named DPDLLM, with the only access to the generated texts from LLM for detecting textual data whether was used to train it. In the proposed framework, we exploit GPT-2 as the reference model to fit the textual data and feed the generated text from LLM into it to acquire sequence probabilities as the significant feature for detection. The experimental results on the benchmark datasets demonstrate that DPDLLM is effective on different popular LLMs and outperforms the existing methods.
pdf
bib
abs
Medical Dialogue System: A Survey of Categories, Methods, Evaluation and Challenges
Xiaoming Shi
|
Zeming Liu
|
Li Du
|
Yuxuan Wang
|
Hongru Wang
|
Yuhang Guo
|
Tong Ruan
|
Jie Xu
|
Xiaofan Zhang
|
Shaoting Zhang
Findings of the Association for Computational Linguistics: ACL 2024
This paper surveys and organizes research works of medical dialog systems, which is an important yet challenging task. Although these systems have been surveyed in the medical community from an application perspective, a systematic review from a rigorous technical perspective has to date remained noticeably absent. As a result, an overview of the categories, methods, evaluation of medical dialogue systems remain limited and underspecified, hindering the further improvement of this area. To fill this gap, we investigate an initial pool of 325 papers from well-known computer science, natural language processing conferences and journals, and make an overview. Recently, large language models have shown strong model capacity on downstream tasks, which also reshape medical dialog systems’ foundation.Despite the alluring practical application value, current medical dialogue systems still suffer from problems. To this end, this paper lists grand challenges of medical dialog systems, especially of large language models.
pdf
bib
abs
Planning, Creation, Usage: Benchmarking LLMs for Comprehensive Tool Utilization in Real-World Complex Scenarios
Shijue Huang
|
Wanjun Zhong
|
Jianqiao Lu
|
Qi Zhu
|
Jiahui Gao
|
Weiwen Liu
|
Yutai Hou
|
Xingshan Zeng
|
Yasheng Wang
|
Lifeng Shang
|
Xin Jiang
|
Ruifeng Xu
|
Qun Liu
Findings of the Association for Computational Linguistics: ACL 2024
The recent trend of using Large Language Models (LLMs) as tool agents in real-world applications underscores the necessity for comprehensive evaluations of their capabilities, particularly in complex scenarios involving planning, creating, and using tools. However, existing benchmarks typically focus on simple synthesized queries that do not reflect real-world complexity, thereby offering limited perspectives in evaluating tool utilization. To address this issue, we present UltraTool, a novel benchmark designed to improve and evaluate LLMs’ ability in tool utilization within real-world scenarios. UltraTool focuses on the entire process of using tools - from planning and creating to applying them in complex tasks. It emphasizes real-world complexities, demanding accurate, multi-step planning for effective problem-solving. A key feature of UltraTool is its independent evaluation of planning with natural language, which happens before tool usage and simplifies the task solving by mapping out the intermediate steps. Thus, unlike previous work, it eliminates the restriction of pre-defined toolset. Through extensive experiments on various LLMs, we offer novel insights into the evaluation of capabilities of LLMs in tool utilization, thereby contributing a fresh perspective to this rapidly evolving field. The benchmark is publicly available at https://github.com/JoeYing1019/UltraTool.
pdf
bib
abs
Abstract Meaning Representation-Based Logic-Driven Data Augmentation for Logical Reasoning
Qiming Bao
|
Alex Yuxuan Peng
|
Zhenyun Deng
|
Wanjun Zhong
|
Gaël Gendron
|
Timothy Pistotti
|
Neşet Tan
|
Nathan Young
|
Yang Chen
|
Yonghua Zhu
|
Paul Denny
|
Michael Witbrock
|
Jiamou Liu
Findings of the Association for Computational Linguistics: ACL 2024
Combining large language models with logical reasoning enhances their capacity to address problems in a robust and reliable manner. Nevertheless, the intricate nature of logical reasoning poses challenges when gathering reliable data from the web to build comprehensive training datasets, subsequently affecting performance on downstream tasks. To address this, we introduce a novel logic-driven data augmentation approach, AMR-LDA. AMR-LDA converts the original text into an Abstract Meaning Representation (AMR) graph, a structured semantic representation that encapsulates the logical structure of the sentence, upon which operations are performed to generate logically modified AMR graphs. The modified AMR graphs are subsequently converted back into text to create augmented data. Notably, our methodology is architecture-agnostic and enhances both generative large language models, such as GPT-3.5 and GPT-4, through prompt augmentation, and discriminative large language models through contrastive learning with logic-driven data augmentation. Empirical evidence underscores the efficacy of our proposed method with improvement in performance across seven downstream tasks, such as reading comprehension requiring logical reasoning, textual entailment, and natural language inference. Furthermore, our method leads on the ReClor leaderboard at https://eval.ai/web/challenges/challenge-page/503/leaderboard/1347. The source code and data are publicly available at https://github.com/Strong-AI-Lab/Logical-Equivalence-driven-AMR-Data-Augmentation-for-Representation-Learning.
pdf
bib
abs
Concise and Precise Context Compression for Tool-Using Language Models
Yang Xu
|
Yunlong Feng
|
Honglin Mu
|
Yutai Hou
|
Yitong Li
|
Xinghao Wang
|
Wanjun Zhong
|
Zhongyang Li
|
Dandan Tu
|
Qingfu Zhu
|
Min Zhang
|
Wanxiang Che
Findings of the Association for Computational Linguistics: ACL 2024
Through reading the documentation in the context, tool-using language models can dynamically extend their capability using external tools. The cost is that we have to input lengthy documentation every time the model needs to use the tool, occupying the input window as well as slowing down the decoding process.Given the progress in general-purpose compression, soft context compression is a suitable approach to alleviate the problem. However, when compressing tool documentation, existing methods suffer from the weaknesses of key information loss (specifically, tool/parameter name errors) and difficulty in adjusting the length of compressed sequences based on documentation lengths.To address these problems, we propose two strategies for compressing tool documentation into concise and precise summary sequences for tool-using language models. 1) Selective compression strategy mitigates key information loss by deliberately retaining key information as raw text tokens. 2) Block compression strategy involves dividing tool documentation into short chunks and then employing a fixed-length compression model to achieve variable-length compression. This strategy facilitates the flexible adjustment of the compression ratio.Results on API-Bank and APIBench show that our approach reaches a performance comparable to the upper-bound baseline under up to 16x compression ratio.
pdf
bib
abs
SeRTS: Self-Rewarding Tree Search for Biomedical Retrieval-Augmented Generation
Minda Hu
|
Licheng Zong
|
Hongru Wang
|
Jingyan Zhou
|
Jingjing Li
|
Yichen Gao
|
Kam-Fai Wong
|
Yu Li
|
Irwin King
Findings of the Association for Computational Linguistics: EMNLP 2024
Large Language Models (LLMs) have shown great potential in the biomedical domain with the advancement of retrieval-augmented generation (RAG). However, existing retrieval-augmented approaches face challenges in addressing diverse queries and documents, particularly for medical knowledge queries, resulting in sub-optimal performance. To address these limitations, we propose a novel plug-and-play LLM-based retrieval method called Self-Rewarding Tree Search (SeRTS) based on Monte Carlo Tree Search (MCTS) and a self-rewarding paradigm. By combining the reasoning capabilities of LLMs with the effectiveness of tree search, SeRTS boosts the zero-shot performance of retrieving high-quality and informative results for RAG. We further enhance retrieval performance by fine-tuning LLMs with Proximal Policy Optimization (PPO) objectives using the trajectories collected by SeRTS as feedback. Controlled experiments using the BioASQ-QA dataset with GPT-3.5-Turbo and LLama2-7b demonstrate that our method significantly improves the performance of the BM25 retriever and surpasses the strong baseline of self-reflection in both efficiency and scalability. Moreover, SeRTS generates higher-quality feedback for PPO training than self-reflection. Our proposed method effectively adapts LLMs to document retrieval tasks, enhancing their ability to retrieve highly relevant documents for RAG in the context of medical knowledge queries. This work presents a significant step forward in leveraging LLMs for accurate and comprehensive biomedical question answering.
pdf
bib
abs
CLongEval: A Chinese Benchmark for Evaluating Long-Context Large Language Models
Zexuan Qiu
|
Jingjing Li
|
Shijue Huang
|
Xiaoqi Jiao
|
Wanjun Zhong
|
Irwin King
Findings of the Association for Computational Linguistics: EMNLP 2024
Developing Large Language Models (LLMs) with robust long-context capabilities has been the recent research focus, resulting in the emergence of long-context LLMs proficient in Chinese. However, the evaluation of these models remains underdeveloped due to a lack of benchmarks. To address this gap, we present CLongEval, a comprehensive Chinese benchmark for evaluating long-context LLMs. CLongEval is characterized by three key features: (1) Sufficient data volume, comprising 7 distinct tasks and 7,267 examples; (2) Broad applicability, accommodating to models with context windows size from 1K to 100K; (3) High quality, with over 2,000 manually annotated question-answer pairs in addition to the automatically constructed labels. With CLongEval, we undertake a comprehensive assessment of 6 open-source long-context LLMs and 2 leading commercial counterparts that feature both long-context abilities and proficiency in Chinese. We also provide in-depth analysis based on the empirical results, trying to shed light on the critical capabilities that present challenges in long-context settings. The dataset, evaluation scripts, and model outputs will be released.
pdf
bib
abs
Less is More: Making Smaller Language Models Competent Subgraph Retrievers for Multi-hop KGQA
Wenyu Huang
|
Guancheng Zhou
|
Hongru Wang
|
Pavlos Vougiouklis
|
Mirella Lapata
|
Jeff Z. Pan
Findings of the Association for Computational Linguistics: EMNLP 2024
Retrieval-Augmented Generation (RAG) is widely used to inject external non-parametric knowledge into large language models (LLMs). Recent works suggest that Knowledge Graphs (KGs) contain valuable external knowledge for LLMs. Retrieving information from KGs differs from extracting it from document sets. Most existing approaches seek to directly retrieve relevant subgraphs, thereby eliminating the need for extensive SPARQL annotations, traditionally required by semantic parsing methods. In this paper, we model the subgraph retrieval task as a conditional generation task handled by small language models. Specifically, we define a subgraph identifier as a sequence of relations, each represented as a special token stored in the language models. Our base generative subgraph retrieval model, consisting of only 220M parameters, achieves competitive retrieval performance compared to state-of-the-art models relying on 7B parameters, demonstrating that small language models are capable of performing the subgraph retrieval task. Furthermore, our largest 3B model, when plugged with an LLM reader, sets new SOTA end-to-end performance on both the WebQSP and CWQ benchmarks. Our model and data will be made available online: https://github.com/hwy9855/GSR.
pdf
bib
abs
JoTR: A Joint Transformer and Reinforcement Learning Framework for Dialogue Policy Learning
Wai-Chung Kwan
|
Huimin Wang
|
Hongru Wang
|
Zezhong Wang
|
Bin Liang
|
Xian Wu
|
Yefeng Zheng
|
Kam-Fai Wong
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Dialogue policy learning (DPL) aims to determine an abstract representation (also known as action) to guide what the response should be. Typically, DPL is cast as a sequential decision problem across a series of predefined action candidates. However, such static and narrow actions can limit response diversity and impede the dialogue agent’s adaptability to new scenarios and edge cases. To overcome these challenges, we introduce a novel Joint Transformer Reinforcement Learning framework, coined as JoTR, where a text-to-text Transformer-based model is employed to directly generate dialogue actions. More concretely, JoTR formulates a token-grained policy, facilitating more dynamic and adaptable dialogue action generation without the need for predefined action candidates. This method not only enhances the diversity of responses but also significantly improves the system’s capability to manage unfamiliar scenarios. Furthermore, JoTR utilizes Reinforcement Learning with a reward-shaping mechanism to efficiently fine-tune the token-grained policy. This allows the model to evolve through interactions, thereby enhancing its performance over time. Our extensive evaluation demonstrates that JoTR surpasses previous state-of-the-art models, showing improvements of 9% and 13% in success rate, and 34% and 37% in the diversity of dialogue actions across two benchmark dialogue modeling tasks respectively. These results have been validated by both user simulators and human evaluators. Code and data are available at ://github.com/KwanWaiChung/JoTR.
pdf
bib
abs
MCIL: Multimodal Counterfactual Instance Learning for Low-resource Entity-based Multimodal Information Extraction
Baohang Zhou
|
Ying Zhang
|
Kehui Song
|
Hongru Wang
|
Yu Zhao
|
Xuhui Sui
|
Xiaojie Yuan
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Multimodal information extraction (MIE) is a challenging task which aims to extract the structural information in free text coupled with the image for constructing the multimodal knowledge graph. The entity-based MIE tasks are based on the entity information to complete the specific tasks. However, the existing methods only investigated the entity-based MIE tasks under supervised learning with adequate labeled data. In the real-world scenario, collecting enough data and annotating the entity-based samples are time-consuming, and impractical. Therefore, we propose to investigate the entity-based MIE tasks under the low-resource settings. The conventional models are prone to overfitting on limited labeled data, which can result in poor performance. This is because the models tend to learn the bias existing in the limited samples, which can lead them to model the spurious correlations between multimodal features and task labels. To provide a more comprehensive understanding of the bias inherent in multimodal features of MIE samples, we decompose the features into image, entity, and context factors. Furthermore, we investigate the causal relationships between these factors and model performance, leveraging the structural causal model to delve into the correlations between the input features and output labels. Based on this, we propose the multimodal counterfactual instance learning framework to generate the counterfactual instances by the interventions on the limited observational samples. In the framework, we analyze the causal effect of the counterfactual instances and exploit it as a supervisory signal to maximize the effect for reducing the bias and improving the generalization of the model. Empirically, we evaluate the proposed method on the two public MIE benchmark datasets and the experimental results verify the effectiveness of it.
pdf
bib
abs
UniRetriever: Multi-task Candidates Selection for Various Context-Adaptive Conversational Retrieval
Hongru Wang
|
Boyang Xue
|
Baohang Zhou
|
Rui Wang
|
Fei Mi
|
Weichao Wang
|
Yasheng Wang
|
Kam-Fai Wong
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Conversational retrieval refers to an information retrieval system that operates in an iterative and interactive manner, requiring the retrieval of various external resources, such as persona, knowledge, and even response, to effectively engage with the user and successfully complete the dialogue. However, most previous work trained independent retrievers for each specific resource, resulting in sub-optimal performance and low efficiency. Thus, we propose a multi-task framework function as a universal retriever for three dominant retrieval tasks during the conversation: persona selection, knowledge selection, and response selection. To this end, we design a dual-encoder architecture consisting of a context-adaptive dialogue encoder and a candidate encoder, aiming to attention to the relevant context from the long dialogue and retrieve suitable candidates by simply a dot product. Furthermore, we introduce two loss constraints to capture the subtle relationship between dialogue context and different candidates by regarding historically selected candidates as hard negatives. Extensive experiments and analysis establish state-of-the-art retrieval quality both within and outside its training domain, revealing the promising potential and generalization capability of our model to serve as a universal retriever for different candidate selection tasks simultaneously.
pdf
bib
abs
SELF-GUARD: Empower the LLM to Safeguard Itself
Zezhong Wang
|
Fangkai Yang
|
Lu Wang
|
Pu Zhao
|
Hongru Wang
|
Liang Chen
|
Qingwei Lin
|
Kam-Fai Wong
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
With the increasing risk posed by jailbreak attacks, recent studies have investigated various methods to improve the safety of large language models (LLMs), mainly falling into two strategies: safety training and safeguards. Safety training involves fine-tuning the LLM with adversarial samples, which activate the LLM’s capabilities against jailbreak. However, it is not always effective in countering new attacks and often leads to potential performance degradation. Safeguards, on the other hand, are methods using additional models to filter harmful content from the LLM’s response. Nevertheless, they can only reduce a limited amount of harmful output and introduce extra computational costs. Given the distinct strengths and weaknesses of both, we combine them to balance out their flaws and propose a more effective method called Self-Guard.Specifically, we train the LLM to review its responses for any harmful content and append a [harmful] or [harmless] tag to the end of the response. In this way, Self-Guard possesses the advantages of safety training, leveraging the powerful capabilities of the LLMs themselves to detect harmfulness. Besides that, it gains flexibility like safeguards, making the safety check target the output side, which makes the system less vulnerable to attack updates. Experimental results indicate that our Self-Guard can effectively defend against jailbreak attacks and will not cause LLMs’ performance degradation.
pdf
bib
abs
Enhancing Large Language Models Against Inductive Instructions with Dual-critique Prompting
Rui Wang
|
Hongru Wang
|
Fei Mi
|
Boyang Xue
|
Yi Chen
|
Kam-Fai Wong
|
Ruifeng Xu
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Numerous works are proposed to align large language models (LLMs) with human intents to better fulfill instructions, ensuring they are trustful and helpful.Nevertheless, some human instructions are often malicious or misleading and following them will lead to untruthful and unsafe responses.Previous work rarely focused on understanding how LLMs manage instructions based on counterfactual premises, referred to here as inductive instructions, which may stem from users’ false beliefs or malicious intents.In this paper, we aim to reveal the behaviors of LLMs towards inductive instructions and enhance their truthfulness and helpfulness accordingly. Specifically, we first introduce a benchmark of Inductive Instructions (INDust), where the false knowledge is incorporated into instructions in multiple different styles. After extensive human and automatic evaluations, we uncovered a universal vulnerability among LLMs in processing inductive instructions.Additionally, we identified that different inductive styles affect the models’ ability to identify the same underlying errors,and the complexity of the underlying assumptions also influences the model’s performance.Motivated by these results, we propose Dual-critique prompting to improve LLM robustness against inductive instructions.Our experiments demonstrate that Dual-critique prompting significantly bolsters the robustness of a diverse array of LLMs, even when confronted with varying degrees of inductive instruction complexity and differing inductive styles.
pdf
bib
abs
Fine-tuning after Prompting: an Explainable Way for Classification
Zezhong Wang
|
Luyao Ye
|
Hongru Wang
|
Boyang Xue
|
Yiming Du
|
Bin Liang
|
Kam-Fai Wong
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)
Prompting is an alternative approach for utilizing pre-trained language models (PLMs) in classification tasks. In contrast to fine-tuning, prompting is more understandable for humans because it utilizes natural language to interact with the PLM, but it often falls short in terms of accuracy. While current research primarily focuses on enhancing the performance of prompting methods to compete with fine-tuning, we believe that these two approaches are not mutually exclusive, each having its strengths and weaknesses. In our study, we depart from the competitive view of prompting versus fine-tuning and instead combine them, introducing a novel method called F&P. This approach enables us to harness the advantages of Fine-tuning for accuracy and the explainability of Prompting simultaneously. Specifically, we reformulate the sample into a prompt and subsequently fine-tune a linear classifier on top of the PLM. Following this, we extract verbalizers according to the weight of this classifier. During the inference phase, we reformulate the sample in the same way and query the PLM. The PLM generates a word, which is then subject to a dictionary lookup by the verbalizer to obtain the prediction. Experiments show that keeping only 30 keywords for each class can achieve comparable performance as fine-tuning. On the other hand, both the prompt and verbalizers are constructed in natural language, making them fully understandable to humans. Hence, the F&P method offers an effective and transparent way to employ a PLM for classification tasks.
pdf
bib
abs
PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Fusion in Question Answering
Yiming Du
|
Hongru Wang
|
Zhengyi Zhao
|
Bin Liang
|
Baojun Wang
|
Wanjun Zhong
|
Zezhong Wang
|
Kam-Fai Wong
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)
In conversational AI, effectively employing long-term memory improves personalized and consistent response generation. Existing work only concentrated on a single type of long-term memory, such as preferences, dialogue history, or social relationships, overlooking their interaction in real-world contexts. To this end, inspired by the concept of semantic memory and episodic memory from cognitive psychology, we create a new and more comprehensive Chinese dataset, coined as PerLTQA, in which world knowledge, profiles, social relationships, events, and dialogues are considered to leverage the interaction between different types of long-term memory for question answering (QA) in conversation. Further, based on PerLTQA, we propose a novel framework for memory integration in QA, consisting of three subtasks: Memory Classification, Memory Retrieval, and Memory Fusion, which provides a comprehensive paradigm for memory modeling, enabling consistent and personalized memory utilization. This essentially allows the exploitation of more accurate memory information for better responses in QA. We evaluate this framework using five LLMs and three retrievers. Experimental results demonstrate the importance of personal long-term memory in the QA task
pdf
bib
abs
PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Fusion in Question Answering
Yiming Du
|
Hongru Wang
|
Zhengyi Zhao
|
Bin Liang
|
Baojun Wang
|
Wanjun Zhong
|
Zezhong Wang
|
Kam-Fai Wong
Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10)
In conversational AI, effectively employing long-term memory improves personalized and consistent response generation. Existing work only concentrated on a single type of long-term memory, such as preferences, dialogue history, or social relationships, overlooking their interaction in real-world contexts. To this end, inspired by the concept of semantic memory and episodic memory from cognitive psychology, we create a new and more comprehensive Chinese dataset, coined as PerLTQA, in which world knowledge, profiles, social relationships, events, and dialogues are considered to leverage the interaction between different types of long-term memory for question answering (QA) in conversation. Further, based on PerLTQA, we propose a novel framework for memory integration in QA, consisting of three subtasks: Memory Classification, Memory Retrieval, and Memory Fusion, which provides a comprehensive paradigm for memory modeling, enabling consistent and personalized memory utilization. This essentially allows the exploitation of more accurate memory information for better responses in QA. We evaluate this framework using five LLMs and three retrievers. Experimental results demonstrate the importance of personal long-term memory in the QA task
2023
pdf
bib
abs
Retrieval-free Knowledge Injection through Multi-Document Traversal for Dialogue Models
Rui Wang
|
Jianzhu Bao
|
Fei Mi
|
Yi Chen
|
Hongru Wang
|
Yasheng Wang
|
Yitong Li
|
Lifeng Shang
|
Kam-Fai Wong
|
Ruifeng Xu
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dialogue models are often enriched with extensive external knowledge to provide informative responses through a retrieval-augmented pipeline. Nevertheless, retrieval-augmented approaches rely on finely annotated retrieval training data and knowledge-grounded response generation data, making it costly to transfer. To tackle this challenge, this paper proposed a retrieval-free approach, KiDG, by automatically turning knowledge documents into simulated multi-turn dialogues through a Multi-Document Traversal algorithm. The simulated knowledge-intensive dialogues constructed by KiDG in one domain can be easily used to train and enhance pre-trained dialogue models’ knowledge w.r.t. this domain without costly annotation. We conduct extensive experiments comparing retrieval-augmented models and a variety of retrieval-free models. We found that dialogue models enhanced with data simulated with KiDG largely outperform state-of-the-art retrieval-free methods, and it achieves comparable performance compared to retrieval-augmented methods while being better, and cheaper at domain transfer.
pdf
bib
abs
CONE: An Efficient COarse-to-fiNE Alignment Framework for Long Video Temporal Grounding
Zhijian Hou
|
Wanjun Zhong
|
Lei Ji
|
Difei Gao
|
Kun Yan
|
W.k. Chan
|
Chong-Wah Ngo
|
Mike Zheng Shou
|
Nan Duan
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
This paper tackles an emerging and challenging problem of long video temporal grounding (VTG) that localizes video moments related to a natural language (NL) query. Compared with short videos, long videos are also highly demanded but less explored, which brings new challenges in higher inference computation cost and weaker multi-modal alignment. To address these challenges, we propose CONE, an efficient COarse-to-fiNE alignment framework. CONE is a plug-and-play framework on top of existing VTG models to handle long videos through a sliding window mechanism. Specifically, CONE (1) introduces a query-guided window selection strategy to speed up inference, and (2) proposes a coarse-to-fine mechanism via a novel incorporation of contrastive learning to enhance multi-modal alignment for long videos. Extensive experiments on two large-scale long VTG benchmarks consistently show both substantial performance gains (e.g., from 3.13 to 6.87% on MAD) and state-of-the-art results. Analyses also reveal higher efficiency as the query-guided window selection mechanism accelerates inference time by 2x on Ego4D-NLQ and 15x on MAD while keeping SOTA results. Codes have been released at
https://github.com/houzhijian/CONE.
pdf
bib
abs
Towards Robust Personalized Dialogue Generation via Order-Insensitive Representation Regularization
Liang Chen
|
Hongru Wang
|
Yang Deng
|
Wai Chung Kwan
|
Zezhong Wang
|
Kam-Fai Wong
Findings of the Association for Computational Linguistics: ACL 2023
Generating persona consistent dialogue response is important for developing an intelligent conversational agent. Recent works typically fine-tune large-scale pre-trained models on this task by concatenating persona texts and dialogue history as a single input sequence to generate the target response. While simple and effective, our analysis shows that this popular practice is seriously affected by order sensitivity where different input orders of persona sentences significantly impact the quality and consistency of generated response, resulting in severe performance fluctuations (i.e., 29.4% on GPT2 and 83.2% on BART). To mitigate the order sensitivity problem, we propose a model-agnostic framework, ORder Insensitive Generation (ORIG), which enables dialogue models to learn robust representation under different persona orders and improve the consistency of response generation. Experiments on the Persona-Chat dataset justify the effectiveness and superiority of our method with two dominant pre-trained models (GPT2 and BART).
pdf
bib
abs
Disentangling Reasoning Capabilities from Language Models with Compositional Reasoning Transformers
Wanjun Zhong
|
Tingting Ma
|
Jiahai Wang
|
Jian Yin
|
Tiejun Zhao
|
Chin-Yew Lin
|
Nan Duan
Findings of the Association for Computational Linguistics: ACL 2023
This paper presents ReasonFormer, a unified reasoning framework for mirroring the modular and compositional reasoning process of humans in complex decision-making. Inspired by dual-process theory in cognitive science, the representation module (automatic thinking) and reasoning modules (controlled thinking) are decoupled to capture different levels of cognition. Upon the top of the representation module, the pre-trained reasoning modules are modular and professional in specific and fundamental reasoning skills (e.g., logic, simple QA, etc). To mimic the controlled compositional thinking process, different reasoning modules are dynamically activated and composed in both parallel and cascaded manners to control what reasoning skills are activated and how deep the reasoning process will be reached to solve the current problems. The unified reasoning framework solves multiple tasks with a single model, and is trained and inferred in an end-to-end manner. Evaluated on 11 datasets requiring different reasoning skills and complexity, ReasonFormer demonstrates substantial performance boosts, revealing the compositional reasoning ability. Few-shot experiments exhibit better generalization ability by learning to compose pre-trained skills for new tasks with limited data, and decoupling the representation module and the reasoning modules. Further analysis shows the modularity of reasoning modules as different tasks activate distinct reasoning skills at different reasoning depths.
pdf
bib
abs
ReadPrompt: A Readable Prompting Method for Reliable Knowledge Probing
Zezhong Wang
|
Luyao Ye
|
Hongru Wang
|
Wai-Chung Kwan
|
David Ho
|
Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2023
Knowledge probing is a task to assess the knowledge encoded within pre-trained language models (PLMs) by having the PLM complete prompts such as “Italy is located in __,”. The model’s prediction precision serves as a lower bound for the amount of knowledge it contains. Subsequent works explore training a series of vectors as prompts to guide PLMs towards more accurate predictions. However, these methods compromise the readability of the prompts. We cannot directly understand these prompts from their literal meaning, making it difficult to verify whether they are correct. Consequently, the credibility of probing results derived from these prompts is diminished. To address the issue, we propose a novel method called ReadPrompt, which aims to identify meaningful sentences to serve as prompts. Experiments show that ReadPrompt achieves state-of-the-art performance on the current knowledge probing benchmark. Moreover, since the prompt is readable, we discovered a misalignment between constructed prompts and knowledge, which is also present in current prompting methods verified by an attack experiment. We claim that the probing outcomes of the current prompting methods are unreliable that overestimate the knowledge contained within PLMs.
pdf
bib
abs
Improving Factual Consistency for Knowledge-Grounded Dialogue Systems via Knowledge Enhancement and Alignment
Boyang Xue
|
Weichao Wang
|
Hongru Wang
|
Fei Mi
|
Rui Wang
|
Yasheng Wang
|
Lifeng Shang
|
Xin Jiang
|
Qun Liu
|
Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2023
Pretrained language models (PLMs) based knowledge-grounded dialogue systems are prone to generate responses that are factually inconsistent with the provided knowledge source. In such inconsistent responses, the dialogue models fail to accurately express the external factual knowledge they rely upon. Inspired by previous work which identified that feedforward networks (FFNs) within Transformers are responsible for factual knowledge expressions, we investigate two methods to efficiently improve the factual expression capability of FFNs by knowledge enhancement and alignment respectively. We first propose K-Dial, which explicitly introduces extended FFNs in Transformers to enhance factual knowledge expressions given the specific patterns of knowledge-grounded dialogue inputs. Additionally, we apply the reinforcement learning for factual consistency (RLFC) method to implicitly adjust FFNs’ expressions in responses by aligning with gold knowledge for the factual consistency preference. To comprehensively assess the factual consistency and dialogue quality of responses, we employ extensive automatic measures and human evaluations including sophisticated fine-grained NLI-based metrics. Experimental results on WoW and CMU_DoG datasets demonstrate that our methods efficiently enhance the ability of the FFN module to convey factual knowledge, validating the efficacy of improving factual consistency for knowledge-grounded dialogue systems.
pdf
bib
abs
Large Language Models as Source Planner for Personalized Knowledge-grounded Dialogues
Hongru Wang
|
Minda Hu
|
Yang Deng
|
Rui Wang
|
Fei Mi
|
Weichao Wang
|
Yasheng Wang
|
Wai-Chung Kwan
|
Irwin King
|
Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2023
Open-domain dialogue system usually requires different sources of knowledge to generate more informative and evidential responses. However, existing knowledge-grounded dialogue systems either focus on a single knowledge source or overlook the dependency between multiple sources of knowledge, which may result in generating inconsistent or even paradoxical responses. To incorporate multiple knowledge sources and dependencies between them, we propose SAFARI, a novel framework that leverages the exceptional capabilities of large language models (LLMs) in planning, understanding, and incorporating under both supervised and unsupervised settings. Specifically, SAFARI decouples the knowledge grounding into multiple sources and response generation, which allows easy extension to various knowledge sources including the possibility of not using any sources. To study the problem, we construct a personalized knowledge-grounded dialogue dataset Knowledge Behind Persona (KBP), which is the first to consider the dependency between persona and implicit knowledge. Experimental results on the KBP dataset demonstrate that the SAFARI framework can effectively produce persona-consistent and knowledge-enhanced responses.
pdf
bib
abs
Prompting and Evaluating Large Language Models for Proactive Dialogues: Clarification, Target-guided, and Non-collaboration
Yang Deng
|
Lizi Liao
|
Liang Chen
|
Hongru Wang
|
Wenqiang Lei
|
Tat-Seng Chua
Findings of the Association for Computational Linguistics: EMNLP 2023
Conversational systems based on Large Language Models (LLMs), such as ChatGPT, show exceptional proficiency in context understanding and response generation. However, they still possess limitations, such as failing to ask clarifying questions to ambiguous queries or refuse users’ unreasonable requests, both of which are considered as key aspects of a conversational agent’s proactivity. This raises the question of whether LLM-based conversational systems are equipped to handle proactive dialogue problems. In this work, we conduct a comprehensive analysis of LLM-based conversational systems, specifically focusing on three key aspects of proactive dialogues: clarification, target-guided, and non-collaborative dialogues. To trigger the proactivity of LLMs, we propose the Proactive Chain-of-Thought prompting scheme, which augments LLMs with the goal planning capability over descriptive reasoning chains. Empirical findings are discussed to promote future studies on LLM-based proactive dialogue systems.
pdf
bib
abs
Cue-CoT: Chain-of-thought Prompting for Responding to In-depth Dialogue Questions with LLMs
Hongru Wang
|
Rui Wang
|
Fei Mi
|
Yang Deng
|
Zezhong Wang
|
Bin Liang
|
Ruifeng Xu
|
Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2023
Large Language Models (LLMs), such as ChatGPT, greatly empower dialogue systems with strong language understanding and generation capabilities. However, most of the previous works prompt the LLMs to directly generate a response based on the dialogue context, overlooking the underlying linguistic cues about the user status exhibited in the context. Such in-depth dialogue scenarios are challenging for existing LLMs to figure out the user’s hidden needs and respond satisfactorily through a single-step inference. To this end, we propose a novel linguistic cue-based chain-of-thoughts (Cue-CoT), which enhances the LLMs inference with an intermediate reasoning step to find cues exhibited in the dialogue, aiming to provide a more personalized and engaging response. To evaluate the approach, we build a benchmark with in-depth dialogue questions, consisting of 6 datasets in both Chinese and English, targeting 3 major linguistic cues during the conversation: personality, emotion, and psychology. We conducted experiments on the proposed benchmark with 5 LLMs under both zero-shot and one-shot settings. Empirical results demonstrate our proposed Cue-CoT method outperforms standard prompting methods in terms of both helpfulness and acceptability on all datasets.
pdf
bib
MCML: A Novel Memory-based Contrastive Meta-Learning Method for Few Shot Slot Tagging
Hongru Wang
|
Zezhong Wang
|
Wai Chung Kwan
|
Kam-Fai Wong
Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics (Volume 1: Long Papers)
2022
pdf
bib
abs
Logic-Driven Context Extension and Data Augmentation for Logical Reasoning of Text
Siyuan Wang
|
Wanjun Zhong
|
Duyu Tang
|
Zhongyu Wei
|
Zhihao Fan
|
Daxin Jiang
|
Ming Zhou
|
Nan Duan
Findings of the Association for Computational Linguistics: ACL 2022
Logical reasoning of text requires identifying critical logical structures in the text and performing inference over them. Existing methods for logical reasoning mainly focus on contextual semantics of text while struggling to explicitly model the logical inference process. In this paper, we not only put forward a logic-driven context extension framework but also propose a logic-driven data augmentation algorithm. The former follows a three-step reasoning paradigm, and each step is respectively to extract logical expressions as elementary reasoning units, symbolically infer the implicit expressions following equivalence laws and extend the context to validate the options. The latter augments literally similar but logically different instances and incorporates contrastive learning to better capture logical information, especially logical negative and conditional relationships. We conduct experiments on two benchmark datasets, ReClor and LogiQA. The results show that our method achieves state-of-the-art performance on both datasets, and even surpasses human performance on the ReClor dataset.
pdf
bib
abs
Analytical Reasoning of Text
Wanjun Zhong
|
Siyuan Wang
|
Duyu Tang
|
Zenan Xu
|
Daya Guo
|
Yining Chen
|
Jiahai Wang
|
Jian Yin
|
Ming Zhou
|
Nan Duan
Findings of the Association for Computational Linguistics: NAACL 2022
Analytical reasoning is an essential and challenging task that requires a system to analyze a scenario involving a set of particular circumstances and perform reasoning over it to make conclusions. However, current neural models with implicit reasoning ability struggle to solve this task. In this paper, we study the challenge of analytical reasoning of text and collect a new dataset consisting of questions from the Law School Admission Test from 1991 to 2016. We analyze what knowledge understanding and reasoning abilities are required to do well on this task, and present an approach dubbed ARM. It extracts knowledge such as participants and facts from the context. Such knowledge are applied to an inference engine to deduce legitimate solutions for drawing conclusions. In our experiments, we find that ubiquitous pre-trained models struggle to deal with this task as their performance is close to random guess. Results show that ARM outperforms pre-trained models significantly. Moreover, we demonstrate that ARM has better explicit interpretable reasoning ability.
pdf
bib
abs
Mixed-modality Representation Learning and Pre-training for Joint Table-and-Text Retrieval in OpenQA
Junjie Huang
|
Wanjun Zhong
|
Qian Liu
|
Ming Gong
|
Daxin Jiang
|
Nan Duan
Findings of the Association for Computational Linguistics: EMNLP 2022
Retrieving evidences from tabular and textual resources is essential for open-domain question answering (OpenQA), which provides more comprehensive information. However, training an effective dense table-text retriever is difficult due to the challenges of table-text discrepancy and data sparsity problem. To address the above challenges, we introduce an optimized OpenQA Table-Text Retriever (OTTeR) to jointly retrieve tabular and textual evidences. Firstly, we propose to enhance mixed-modality representation learning via two mechanisms: modality-enhanced representation and mixed-modality negative sampling strategy. Secondly, to alleviate data sparsity problem and enhance the general retrieval ability, we conduct retrieval-centric mixed-modality synthetic pre-training. Experimental results demonstrate that OTTeR substantially improves the performance of table-and-text retrieval on the OTT-QA dataset. Comprehensive analyses examine the effectiveness of all the proposed mechanisms. Besides, equipped with OTTeR, our OpenQA system achieves the state-of-the-art result on the downstream QA task, with 10.1% absolute improvement in terms of the exact match over the previous best system.
pdf
bib
abs
DIGAT: Modeling News Recommendation with Dual-Graph Interaction
Zhiming Mao
|
Jian Li
|
Hongru Wang
|
Xingshan Zeng
|
Kam-Fai Wong
Findings of the Association for Computational Linguistics: EMNLP 2022
News recommendation (NR) is essential for online news services. Existing NR methods typically adopt a news-user representation learning framework, facing two potential limitations. First, in news encoder, single candidate news encoding suffers from an insufficient semantic information problem. Second, existing graph-based NR methods are promising but lack effective news-user feature interaction, rendering the graph-based recommendation suboptimal. To overcome these limitations, we propose dual-interactive graph attention networks (DIGAT) consisting of news- and user-graph channels. In the news-graph channel, we enrich the semantics of single candidate news by incorporating the semantically relevant news information with a semantic-augmented graph (SAG). In the user-graph channel, multi-level user interests are represented with a news-topic graph. Most notably, we design a dual-graph interaction process to perform effective feature interaction between the news and user graphs, which facilitates accurate news-user representation matching. Experiment results on the benchmark dataset MIND show that DIGAT outperforms existing news recommendation methods. Further ablation studies and analyses validate the effectiveness of (1) semantic-augmented news graph modeling and (2) dual-graph interaction.
pdf
bib
TopicRefine: Joint Topic Prediction and Dialogue Response Generation for Multi-turn End-to-End Dialogue System
Hongru Wang
|
Mingyu Cui
|
Zimo Zhou
|
Kam-Fai Wong
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)
pdf
bib
Prior Omission of Dissimilar Source Domain(s) for Cost-Effective Few-Shot Learning
Zezhong Wang
|
Hongru Wang
|
Wai Chung Kwan
|
Kam-Fai Wong
Proceedings of the 5th International Conference on Natural Language and Speech Processing (ICNLSP 2022)
pdf
bib
abs
ProQA: Structural Prompt-based Pre-training for Unified Question Answering
Wanjun Zhong
|
Yifan Gao
|
Ning Ding
|
Yujia Qin
|
Zhiyuan Liu
|
Ming Zhou
|
Jiahai Wang
|
Jian Yin
|
Nan Duan
Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
Question Answering (QA) is a longstanding challenge in natural language processing. Existing QA works mostly focus on specific question types, knowledge domains, or reasoning skills. The specialty in QA research hinders systems from modeling commonalities between tasks and generalization for wider applications. To address this issue, we present ProQA, a unified QA paradigm that solves various tasks through a single model. ProQA takes a unified structural prompt as the bridge and improves the QA-centric ability by structural prompt-based pre-training. Through a structurally designed prompt-based input schema, ProQA concurrently models the knowledge generalization for all QA tasks while keeping the knowledge customization for every specific QA task. Furthermore, ProQA is pre-trained with structural prompt-formatted large-scale synthesized corpus, which empowers the model with the commonly-required QA ability. Experimental results on 11 QA benchmarks demonstrate that ProQA consistently boosts performance on both full data fine-tuning, few-shot learning, and zero-shot testing scenarios. Furthermore, ProQA exhibits strong ability in both continual learning and transfer learning by taking the advantages of the structural prompt.
2021
pdf
bib
abs
Compare to The Knowledge: Graph Neural Fake News Detection with External Knowledge
Linmei Hu
|
Tianchi Yang
|
Luhao Zhang
|
Wanjun Zhong
|
Duyu Tang
|
Chuan Shi
|
Nan Duan
|
Ming Zhou
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Nowadays, fake news detection, which aims to verify whether a news document is trusted or fake, has become urgent and important. Most existing methods rely heavily on linguistic and semantic features from the news content, and fail to effectively exploit external knowledge which could help determine whether the news document is trusted. In this paper, we propose a novel end-to-end graph neural model called CompareNet, which compares the news to the knowledge base (KB) through entities for fake news detection. Considering that fake news detection is correlated with topics, we also incorporate topics to enrich the news representation. Specifically, we first construct a directed heterogeneous document graph for each news incorporating topics and entities. Based on the graph, we develop a heterogeneous graph attention network for learning the topic-enriched news representation as well as the contextual entity representations that encode the semantics of the news content. The contextual entity representations are then compared to the corresponding KB-based entity representations through a carefully designed entity comparison network, to capture the consistency between the news content and KB. Finally, the topic-enriched news representation combining the entity comparison features is fed into a fake news classifier. Experimental results on two benchmark datasets demonstrate that CompareNet significantly outperforms state-of-the-art methods.
pdf
bib
abs
Syntax-Enhanced Pre-trained Model
Zenan Xu
|
Daya Guo
|
Duyu Tang
|
Qinliang Su
|
Linjun Shou
|
Ming Gong
|
Wanjun Zhong
|
Xiaojun Quan
|
Daxin Jiang
|
Nan Duan
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
We study the problem of leveraging the syntactic structure of text to enhance pre-trained models such as BERT and RoBERTa. Existing methods utilize syntax of text either in the pre-training stage or in the fine-tuning stage, so that they suffer from discrepancy between the two stages. Such a problem would lead to the necessity of having human-annotated syntactic information, which limits the application of existing methods to broader scenarios. To address this, we present a model that utilizes the syntax of text in both pre-training and fine-tuning stages. Our model is based on Transformer with a syntax-aware attention layer that considers the dependency tree of the text. We further introduce a new pre-training task of predicting the syntactic distance among tokens in the dependency tree. We evaluate the model on three downstream tasks, including relation classification, entity typing, and question answering. Results show that our model achieves state-of-the-art performance on six public benchmark datasets. We have two major findings. First, we demonstrate that infusing automatically produced syntax of text improves pre-trained models. Second, global syntactic distances among tokens bring larger performance gains compared to local head relations between contiguous tokens.
pdf
bib
UserAdapter: Few-Shot User Learning in Sentiment Analysis
Wanjun Zhong
|
Duyu Tang
|
Jiahai Wang
|
Jian Yin
|
Nan Duan
Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021
pdf
bib
abs
WhiteningBERT: An Easy Unsupervised Sentence Embedding Approach
Junjie Huang
|
Duyu Tang
|
Wanjun Zhong
|
Shuai Lu
|
Linjun Shou
|
Ming Gong
|
Daxin Jiang
|
Nan Duan
Findings of the Association for Computational Linguistics: EMNLP 2021
Producing the embedding of a sentence in anunsupervised way is valuable to natural language matching and retrieval problems in practice. In this work, we conduct a thorough examination of pretrained model based unsupervised sentence embeddings. We study on fourpretrained models and conduct massive experiments on seven datasets regarding sentence semantics. We have three main findings. First, averaging all tokens is better than only using [CLS] vector. Second, combining both topand bottom layers is better than only using toplayers. Lastly, an easy whitening-based vector normalization strategy with less than 10 linesof code consistently boosts the performance. The whole project including codes and data is publicly available at
https://github.com/Jun-jie-Huang/WhiteningBERT.
2020
pdf
bib
abs
LogicalFactChecker: Leveraging Logical Operations for Fact Checking with Graph Module Network
Wanjun Zhong
|
Duyu Tang
|
Zhangyin Feng
|
Nan Duan
|
Ming Zhou
|
Ming Gong
|
Linjun Shou
|
Daxin Jiang
|
Jiahai Wang
|
Jian Yin
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Verifying the correctness of a textual statement requires not only semantic reasoning about the meaning of words, but also symbolic reasoning about logical operations like count, superlative, aggregation, etc. In this work, we propose LogicalFactChecker, a neural network approach capable of leveraging logical operations for fact checking. It achieves the state-of-the-art performance on TABFACT, a large-scale, benchmark dataset built for verifying a textual statement with semi-structured tables. This is achieved by a graph module network built upon the Transformer-based architecture. With a textual statement and a table as the input, LogicalFactChecker automatically derives a program (a.k.a. logical form) of the statement in a semantic parsing manner. A heterogeneous graph is then constructed to capture not only the structures of the table and the program, but also the connections between inputs with different modalities. Such a graph reveals the related contexts of each word in the statement, the table and the program. The graph is used to obtain graph-enhanced contextual representations of words in Transformer-based architecture. After that, a program-driven module network is further introduced to exploit the hierarchical structure of the program, where semantic compositionality is dynamically modeled along the program structure with a set of function-specific modules. Ablation experiments suggest that both the heterogeneous graph and the module network are important to obtain strong results.
pdf
bib
abs
Reasoning Over Semantic-Level Graph for Fact Checking
Wanjun Zhong
|
Jingjing Xu
|
Duyu Tang
|
Zenan Xu
|
Nan Duan
|
Ming Zhou
|
Jiahai Wang
|
Jian Yin
Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Fact checking is a challenging task because verifying the truthfulness of a claim requires reasoning about multiple retrievable evidence. In this work, we present a method suitable for reasoning about the semantic-level structure of evidence. Unlike most previous works, which typically represent evidence sentences with either string concatenation or fusing the features of isolated evidence sentences, our approach operates on rich semantic structures of evidence obtained by semantic role labeling. We propose two mechanisms to exploit the structure of evidence while leveraging the advances of pre-trained models like BERT, GPT or XLNet. Specifically, using XLNet as the backbone, we first utilize the graph structure to re-define the relative distances of words, with the intuition that semantically related words should have short distances. Then, we adopt graph convolutional network and graph attention network to propagate and aggregate information from neighboring nodes on the graph. We evaluate our system on FEVER, a benchmark dataset for fact checking, and find that rich structural information is helpful and both our graph-based mechanisms improve the accuracy. Our model is the state-of-the-art system in terms of both official evaluation metrics, namely claim verification accuracy and FEVER score.
pdf
bib
abs
Neural Deepfake Detection with Factual Structure of Text
Wanjun Zhong
|
Duyu Tang
|
Zenan Xu
|
Ruize Wang
|
Nan Duan
|
Ming Zhou
|
Jiahai Wang
|
Jian Yin
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
Deepfake detection, the task of automatically discriminating machine-generated text, is increasingly critical with recent advances in natural language generative models. Existing approaches to deepfake detection typically represent documents with coarse-grained representations. However, they struggle to capture factual structures of documents, which is a discriminative factor between machine-generated and human-written text according to our statistical analysis. To address this, we propose a graph-based model that utilizes the factual structure of a document for deepfake detection of text. Our approach represents the factual structure of a given document as an entity graph, which is further utilized to learn sentence representations with a graph neural network. Sentence representations are then composed to a document representation for making predictions, where consistent relations between neighboring sentences are sequentially modeled. Results of experiments on two public deepfake datasets show that our approach significantly improves strong base models built with RoBERTa. Model analysis further indicates that our model can distinguish the difference in the factual structure between machine-generated text and human-written text.
pdf
bib
abs
Leveraging Declarative Knowledge in Text and First-Order Logic for Fine-Grained Propaganda Detection
Ruize Wang
|
Duyu Tang
|
Nan Duan
|
Wanjun Zhong
|
Zhongyu Wei
|
Xuanjing Huang
|
Daxin Jiang
|
Ming Zhou
Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP)
We study the detection of propagandistic text fragments in news articles. Instead of merely learning from input-output datapoints in training data, we introduce an approach to inject declarative knowledge of fine-grained propaganda techniques. Specifically, we leverage the declarative knowledge expressed in both first-order logic and natural language. The former refers to the logical consistency between coarse- and fine-grained predictions, which is used to regularize the training process with propositional Boolean expressions. The latter refers to the literal definition of each propaganda technique, which is utilized to get class representations for regularizing the model parameters. We conduct experiments on Propaganda Techniques Corpus, a large manually annotated dataset for fine-grained propaganda detection. Experiments show that our method achieves superior performance, demonstrating that leveraging declarative knowledge can help the model to make more accurate predictions.
pdf
bib
abs
CUHK at SemEval-2020 Task 4: CommonSense Explanation, Reasoning and Prediction with Multi-task Learning
Hongru Wang
|
Xiangru Tang
|
Sunny Lai
|
Kwong Sak Leung
|
Jia Zhu
|
Gabriel Pui Cheong Fung
|
Kam-Fai Wong
Proceedings of the Fourteenth Workshop on Semantic Evaluation
This paper describes our system submitted to task 4 of SemEval 2020: Commonsense Validation and Explanation (ComVE) which consists of three sub-tasks. The task is to directly validate the given sentence whether or not to make sense and require the model to explain it. Based on BERT architecture with the multi-task setting, we propose an effective and interpretable “Explain, Reason and Predict” (ERP) system to solve the three sub-tasks about commonsense: (a) Validation, (b) Reasoning, and (c) Explanation. Inspired by cognitive studies of common sense, our system first generates a reason or understanding of the sentences and then choose which one statement makes sense, which is achieved by multi-task learning. During the post-evaluation, our system has reached 92.9% accuracy in subtask A (rank 11), 89.7% accuracy in subtask B (rank 9), and BLEU score of 12.9 in subtask C (rank 8).