2025
pdf
bib
abs
EvoWiki: Evaluating LLMs on Evolving Knowledge
Wei Tang
|
Yixin Cao
|
Yang Deng
|
Jiahao Ying
|
Bo Wang
|
Yizhe Yang
|
Yuyue Zhao
|
Qi Zhang
|
Xuanjing Huang
|
Yu-Gang Jiang
|
Yong Liao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Knowledge utilization is a critical aspect of LLMs, and understanding how they adapt to evolving knowledge is essential for their effective deployment. However, existing benchmarks are predominantly static, failing to capture the evolving nature of LLMs and knowledge, leading to inaccuracies and vulnerabilities such as contamination. In this paper, we introduce EvoWiki, an evolving dataset designed to reflect knowledge evolution by categorizing information into stable, evolved, and uncharted states. EvoWiki is fully auto-updatable, enabling precise evaluation of continuously changing knowledge and newly released LLMs. Through experiments with Retrieval-Augmented Generation (RAG) and Continual Learning (CL), we evaluate how effectively LLMs adapt to evolving knowledge. Our results indicate that current models often struggle with evolved knowledge, frequently providing outdated or incorrect responses. Moreover, the dataset highlights a synergistic effect between RAG and CL, demonstrating their potential to better adapt to evolving knowledge. EvoWiki provides a robust benchmark for advancing future research on the knowledge evolution capabilities of large language models.
pdf
bib
abs
AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
Jianlyu Chen
|
Nan Wang
|
Chaofan Li
|
Bo Wang
|
Shitao Xiao
|
Han Xiao
|
Hao Liao
|
Defu Lian
|
Zheng Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.
pdf
bib
abs
Decoding Knowledge Attribution in Mixture-of-Experts: A Framework of Basic-Refinement Collaboration and Efficiency Analysis
Junzhuo Li
|
Bo Wang
|
Xiuze Zhou
|
Peijie Jiang
|
Jia Liu
|
Xuming Hu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The interpretability of Mixture-of-Experts (MoE) models, especially those with heterogeneous designs, remains underexplored. Existing attribution methods for dense models fail to capture dynamic routing-expert interactions in sparse MoE architectures. To address this issue, we propose a cross-level attribution algorithm to analyze sparse MoE architectures (Qwen 1.5-MoE, OLMoE, Mixtral-8x7B) against dense models (Qwen 1.5-7B, Llama-7B, Mistral-7B). Results show MoE models achieve 31% higher per-layer efficiency via a “mid-activation, late-amplification” pattern: early layers screen experts, while late layers refine knowledge collaboratively. Ablation studies reveal a “basic-refinement” framework—shared experts handle general tasks (entity recognition), while routed experts specialize in domain-specific processing (geographic attributes). Semantic-driven routing is evidenced by strong correlations between attention heads and experts (r=0.68), enabling task-aware coordination. Notably, architectural depth dictates robustness: deep Qwen-MoE mitigates expert failures (e.g., 43% MRR drop in geographic tasks when blocking top-10 experts) through shared expert redundancy, whereas shallow Olmoe suffers severe degradation (76% drop). Task sensitivity further guides design: core-sensitive tasks (geography) require concentrated expertise, while distributed-tolerant tasks (object attributes) leverage broader participation. These insights advance MoE interpretability, offering principles to balance efficiency, specialization, and robustness.
pdf
bib
abs
Unveil: Unified Visual-Textual Integration and Distillation for Multi-modal Document Retrieval
Hao Sun
|
Yingyan Hou
|
Jiayan Guo
|
Bo Wang
|
Chunyu Yang
|
Jinsong Ni
|
Yan Zhang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Document retrieval in real-world scenarios faces significant challenges due to diverse document formats and modalities. Traditional text-based approaches rely on tailored parsing techniques that disregard layout information and are prone to errors, while recent parsing-free visual methods often struggle to capture fine-grained textual semantics in text-rich scenarios. To address these limitations, we propose Unveil, a novel visual-textual embedding framework that effectively integrates textual and visual features for robust document representation. Through knowledge distillation, we transfer the semantic understanding capabilities from the visual-textual embedding model to a purely visual model, enabling efficient parsing-free retrieval while preserving semantic fidelity. Experimental results demonstrate that our visual-textual embedding method surpasses existing approaches, while knowledge distillation successfully bridges the performance gap between visual-textual and visual-only methods, improving both retrieval accuracy and efficiency.
pdf
bib
abs
Unveiling Privacy Risks in LLM Agent Memory
Bo Wang
|
Weiyi He
|
Shenglai Zeng
|
Zhen Xiang
|
Yue Xing
|
Jiliang Tang
|
Pengfei He
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Model (LLM) agents have become increasingly prevalent across various real-world applications. They enhance decision-making by storing private user-agent interactions in the memory module for demonstrations, introducing new privacy risks for LLM agents. In this work, we systematically investigate the vulnerability of LLM agents to our proposed Memory EXTRaction Attack (MEXTRA) under a black-box setting. To extract private information from memory, we propose an effective attacking prompt design and an automated prompt generation method based on different levels of knowledge about the LLM agent. Experiments on two representative agents demonstrate the effectiveness of MEXTRA. Moreover, we explore key factors influencing memory leakage from both the agent designer’s and the attacker’s perspectives. Our findings highlight the urgent need for effective memory safeguards in LLM agent design and deployment.
pdf
bib
abs
DualRAG: A Dual-Process Approach to Integrate Reasoning and Retrieval for Multi-Hop Question Answering
Rong Cheng
|
Jinyi Liu
|
Yan Zheng
|
Fei Ni
|
Jiazhen Du
|
Hangyu Mao
|
Fuzheng Zhang
|
Bo Wang
|
Jianye Hao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multi-Hop Question Answering (MHQA) tasks permeate real-world applications, posing challenges in orchestrating multi-step reasoning across diverse knowledge domains. While existing approaches have been improved with iterative retrieval, they still struggle to identify and organize dynamic knowledge. To address this, we propose DualRAG, a synergistic dual-process framework that seamlessly integrates reasoning and retrieval. DualRAG operates through two tightly coupled processes: Reasoning-augmented Querying (RaQ) and progressive Knowledge Aggregation (pKA). They work in concert: as RaQ navigates the reasoning path and generates targeted queries, pKA ensures that newly acquired knowledge is systematically integrated to support coherent reasoning. This creates a virtuous cycle of knowledge enrichment and reasoning refinement. Through targeted fine-tuning, DualRAG preserves its sophisticated reasoning and retrieval capabilities even in smaller-scale models, demonstrating its versatility and core advantages across different scales. Extensive experiments demonstrate that this dual-process approach substantially improves answer accuracy and coherence, approaching, and in some cases surpassing, the performance achieved with oracle knowledge access. These results establish DualRAG as a robust and efficient solution for complex multi-hop reasoning tasks.
pdf
bib
abs
ECC: Synergizing Emotion, Cause and Commonsense for Empathetic Dialogue Generation
Xu Wang
|
Bo Wang
|
Yihong Tang
|
Dongming Zhao
|
Jing Liu
|
Ruifang He
|
Yuexian Hou
Proceedings of the 31st International Conference on Computational Linguistics
Empathy improves human-machine dialogue systems by enhancing the user’s experience. While traditional models have aimed to detect and express users’ emotions from dialogue history, they neglect the crucial and complex interactions among emotion, emotion causes, and commonsense. To address this, we introduce the ECC (Emotion, Cause, and Commonsense) framework, which leverages specialized encoders to capture the key features of emotion, cause, and commonsense and collaboratively models these through a Conditional Variational Auto-Encoder. ECC further employs novel loss functions to refine the interplay of three factors and generates empathetic responses using an energy-based model supported by ODE sampling. Empirical results on the EmpatheticDialogues dataset demonstrate that ECC outperforms existing baselines, offering a robust solution for empathetic dialogue generation.
pdf
bib
abs
RoleBreak: Character Hallucination as a Jailbreak Attack in Role-Playing Systems
Yihong Tang
|
Bo Wang
|
Xu Wang
|
Dongming Zhao
|
Jing Liu
|
Ruifang He
|
Yuexian Hou
Proceedings of the 31st International Conference on Computational Linguistics
Role-playing systems powered by large language models (LLMs) have become increasingly influential in emotional communication applications. However, these systems are susceptible to character hallucinations, where the model deviates from predefined character roles and generates responses that are inconsistent with the intended persona. This paper presents the first systematic analysis of character hallucination from an attack perspective, introducing the RoleBreak framework. Our framework identifies two core mechanisms—query sparsity and role-query conflict—as key factors driving character hallucination. Leveraging these insights, we construct a novel dataset, RoleBreakEval, to evaluate existing hallucination mitigation techniques. Our experiments reveal that even enhanced models trained to minimize hallucination remain vulnerable to attacks. To address these vulnerabilities, we propose a novel defence strategy, the Narrator Mode, which generates supplemental context through narration to mitigate role-query conflicts and improve query generalization. Experimental results demonstrate that Narrator Mode significantly outperforms traditional refusal-based strategies by reducing hallucinations, enhancing fidelity to character roles and queries, and improving overall narrative coherence.
pdf
bib
abs
Mitigating the Discrepancy Between Video and Text Temporal Sequences: A Time-Perception Enhanced Video Grounding method for LLM
Xuefen Li
|
Bo Wang
|
Ge Shi
|
Chong Feng
|
Jiahao Teng
Proceedings of the 31st International Conference on Computational Linguistics
Existing video LLMs typically excel at capturing the overall description of a video but lack the ability to demonstrate an understanding of temporal dynamics and a fine-grained grasp of localized content within the video. In this paper, we propose a Time-Perception Enhanced Video Grounding via Boundary Perception and Temporal Reasoning aimed at mitigating LLMs’ difficulties in understanding the discrepancies between video and text temporality. Specifically, to address the inherent biases in current datasets, we design a series of boundary-perception tasks to enable LLMs to capture accurate video temporality. To tackle LLMs’ insufficient understanding of temporal information, we develop specialized tasks for boundary perception and temporal relationship reasoning to deepen LLMs’ perception of video temporality. Our experimental results show significant improvements across three datasets: ActivityNet, Charades, and DiDeMo (achieving up to 11.2% improvement on R@0.3), demonstrating the effectiveness of our proposed temporal awareness-enhanced data construction method.
pdf
bib
abs
Explicit vs. Implicit: Investigating Social Bias in Large Language Models through Self-Reflection
Yachao Zhao
|
Bo Wang
|
Yan Wang
|
Dongming Zhao
|
Ruifang He
|
Yuexian Hou
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have been shown to exhibit various biases and stereotypes in their generated content. While extensive research has investigated biases in LLMs, prior work has predominantly focused on explicit bias, with minimal attention to implicit bias and the relation between these two forms of bias. This paper presents a systematic framework grounded in social psychology theories to investigate and compare explicit and implicit biases in LLMs.We propose a novel self-reflection-based evaluation framework that operates in two phases: first measuring implicit bias through simulated psychological assessment methods, then evaluating explicit bias by prompting LLMs to analyze their own generated content. Through extensive experiments on advanced LLMs across multiple social dimensions, we demonstrate that LLMs exhibit a substantial inconsistency between explicit and implicit biases: while explicit bias manifests as mild stereotypes, implicit bias exhibits strong stereotypes.We further investigate the underlying factors contributing to this explicit-implicit bias inconsistency, examining the effects of training data scale, model size, and alignment techniques. Experimental results indicate that while explicit bias declines with increased training data and model size, implicit bias exhibits a contrasting upward trend. Moreover, contemporary alignment methods effectively suppress explicit bias but show limited efficacy in mitigating implicit bias.
pdf
bib
abs
The Rise of Darkness: Safety-Utility Trade-Offs in Role-Playing Dialogue Agents
Yihong Tang
|
Kehai Chen
|
Xuefeng Bai
|
Zheng-Yu Niu
|
Bo Wang
|
Jie Liu
|
Min Zhang
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) have made remarkable advances in role-playing dialogue agents, demonstrating their utility in character simulations. However, it remains challenging for these agents to balance character portrayal utility with content safety because this essential character simulation often comes with the risk of generating unsafe content. To address this issue, we first conduct a systematic exploration of the safety-utility trade-off across multiple LLMs. Our analysis reveals that risk scenarios created by villain characters and user queries (referred to as risk coupling) contribute to this trade-off. Building on this, we propose a novel Adaptive Dynamic Multi-Preference (ADMP) method, which dynamically adjusts safety-utility preferences based on the degree of risk coupling and guides the model to generate responses biased toward utility or safety. We further introduce Coupling Margin Sampling (CMS) into coupling detection to enhance the model’s ability to handle high-risk scenarios. Experimental results demonstrate that our approach improves safety metrics while maintaining utility.
pdf
bib
abs
Dynamic Personality in LLM Agents: A Framework for Evolutionary Modeling and Behavioral Analysis in the Prisoner’s Dilemma
Weiqi Zeng
|
Bo Wang
|
Dongming Zhao
|
Zongfeng Qu
|
Ruifang He
|
Yuexian Hou
|
Qinghua Hu
Findings of the Association for Computational Linguistics: ACL 2025
Using Large Language Model agents to simulate human game behaviors offers valuable insights for human social psychology in anthropomorphic AI research. While current models rely on static personality traits, real-world evidence shows personality evolves through environmental feedback. Recent work introduced dynamic personality traits but lacked natural selection processes and direct psychological metrics, failing to accurately capture authentic dynamic personality variations. To address these limitations, we propose an enhanced framework within the Prisoner’s Dilemma, a socially significant scenario. By using game payoffs as environmental feedback, we drive adaptive personality evolution and analyze correlations between personality metrics and behavior. Our framework reveals new behavioral patterns of agents and evaluates personality-behavior relationships, advancing agent-based social simulations and human-AI symbiosis research.
pdf
bib
abs
Cognitive Mirroring for DocRE: A Self-Supervised Iterative Reflection Framework with Triplet-Centric Explicit and Implicit Feedback
Xu Han
|
Bo Wang
|
Yueheng Sun
|
Dongming Zhao
|
Zongfeng Qu
|
Ruifang He
|
Yuexian Hou
|
Qinghua Hu
Proceedings of the 1st Joint Workshop on Large Language Models and Structure Modeling (XLLM 2025)
Large language models (LLMs) have advanced document-level relation extraction (DocRE), but DocRE is more complex than sentence-level relation extraction (SentRE), facing challenges like diverse relation types, coreference resolution and long-distance dependencies. Traditional pipeline methods, which detect relations before generating triplets, often propagate errors and harm performance. Meanwhile, fine-tuning methods require extensive human-annotated data, and in-context learning (ICL) underperforms compared to supervised approaches. We propose an iterative reflection framework for DocRE, inspired by human non-linear reading cognition. The framework leverages explicit and implicit relations between triplets to provide feedback for LLMs refinement. Explicit feedback uses logical rules-based reasoning, while implicit feedback reconstructs triplets into documents for comparison. This dual-process iteration mimics human semantic cognition, enabling dynamic optimization through self-generated supervision. For the first time, this achieves zero-shot performance comparable to fully supervised models. Experiments show our method surpasses existing LLM-based approaches and matches state-of-the-art BERT-based methods.
2024
pdf
bib
abs
WangLab at MEDIQA-CORR 2024: Optimized LLM-based Programs for Medical Error Detection and Correction
Augustin Toma
|
Ronald Xie
|
Steven Palayew
|
Patrick Lawler
|
Bo Wang
Proceedings of the 6th Clinical Natural Language Processing Workshop
Medical errors in clinical text pose significant risks to patient safety. The MEDIQA-CORR 2024 shared task focuses on detecting and correcting these errors across three subtasks: identifying the presence of an error, extracting the erroneous sentence, and generating a corrected sentence. In this paper, we present our approach that achieved top performance in all three subtasks. For the MS dataset, which contains subtle errors, we developed a retrieval-based system leveraging external medical question-answering datasets. For the UW dataset, reflecting more realistic clinical notes, we created a pipeline of modules to detect, localize, and correct errors. Both approaches utilized the DSPy framework for optimizing prompts and few-shot examples in large language model (LLM) based programs. Our results demonstrate the effectiveness of LLM based programs for medical error correction. However, our approach has limitations in addressing the full diversity of potential errors in medical documentation. We discuss the implications of our work and highlight future research directions to advance the robustness and applicability of medical error detection and correction systems.
pdf
bib
abs
WangLab at MEDIQA-M3G 2024: Multimodal Medical Answer Generation using Large Language Models
Ronald Xie
|
Steven Palayew
|
Augustin Toma
|
Gary Bader
|
Bo Wang
Proceedings of the 6th Clinical Natural Language Processing Workshop
This paper outlines our submission to the MEDIQA2024 Multilingual and Multimodal Medical Answer Generation (M3G) shared task. We report results for two standalone solutions under the English category of the task, the first involving two consecutive API calls to the Claude 3 Opus API and the second involving training an image-disease label joint embedding in the style of CLIP for image classification. These two solutions scored 1st and 2nd place respectively on the competition leaderboard, substantially outperforming the next best solution. Additionally, we discuss insights gained from post-competition experiments. While the performance of these two described solutions have significant room for improvement due to the difficulty of the shared task and the challenging nature of medical visual question answering in general, we identify the multi-stage LLM approach and the CLIP image classification approach as promising avenues for further investigation.
pdf
bib
abs
MORPHEUS: Modeling Role from Personalized Dialogue History by Exploring and Utilizing Latent Space
Yihong Tang
|
Bo Wang
|
Dongming Zhao
|
Jinxiaojia Jinxiaojia
|
Zhangjijun Zhangjijun
|
Ruifang He
|
Yuexian Hou
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Personalized Dialogue Generation (PDG) aims to create coherent responses according to roles or personas. Traditional PDG relies on external role data, which can be scarce and raise privacy concerns. Approaches address these issues by extracting role information from dialogue history, which often fail to generically model roles in continuous space. To overcome these limitations, we introduce a novel framework Models Roles from Personalized Dialogue History by Exploring and Utilizing Latent Space (MORPHEUS) through a three-stage training process. Specifically, we create a persona codebook to represent roles in latent space compactly, and this codebook is used to construct a posterior distribution of role information. This method enables the model to generalize across roles, allowing the generation of personalized dialogues even for unseen roles. Experiments on both Chinese and English datasets demonstrate that MORPHEUS enhances the extraction of role information, and improves response generation without external role data. Additionally, MORPHEUS can be considered an efficient fine-tuning for large language models.
pdf
bib
abs
AdaSwitch: Adaptive Switching between Small and Large Agents for Effective Cloud-Local Collaborative Learning
Hao Sun
|
Jiayi Wu
|
Hengyi Cai
|
Xiaochi Wei
|
Yue Feng
|
Bo Wang
|
Shuaiqiang Wang
|
Yan Zhang
|
Dawei Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recent advancements in large language models (LLMs) have been remarkable. Users face a choice between using cloud-based LLMs for generation quality and deploying local-based LLMs for lower computational cost. The former option is typically costly and inefficient, while the latter usually fails to deliver satisfactory performance for reasoning steps requiring deliberate thought processes. In this work, we propose a novel LLM utilization paradigm that facilitates the collaborative operation of large cloud-based LLMs and smaller local-deployed LLMs. Our framework comprises two primary modules: the local agent instantiated with a relatively smaller LLM, handling less complex reasoning steps, and the cloud agent equipped with a larger LLM, managing more intricate reasoning steps. This collaborative processing is enabled through an adaptive mechanism where the local agent introspectively identifies errors and proactively seeks assistance from the cloud agent, thereby effectively integrating the strengths of both locally-deployed and cloud-based LLMs, resulting in significant enhancements in task completion performance and efficiency. We evaluate AdaSwitch across 7 benchmarks, ranging from mathematical reasoning and complex question answering, using various types of LLMs to instantiate the local and cloud agents. The empirical results show that AdaSwitch effectively improves the performance of the local agent, and sometimes achieves competitive results compared to the cloud agent while utilizing much less computational overhead.
pdf
bib
abs
Retrieved In-Context Principles from Previous Mistakes
Hao Sun
|
Yong Jiang
|
Bo Wang
|
Yingyan Hou
|
Yan Zhang
|
Pengjun Xie
|
Fei Huang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
In-context learning (ICL) has been instrumental in adapting large language models (LLMs) to downstream tasks using correct input-output examples. Recent advances have attempted to improve model performance through principles derived from mistakes, yet these approaches suffer from lack of customization and inadequate error coverage. To address these limitations, we propose Retrieved In-Context Principles (RICP), a novel teacher-student framework. In RICP, the teacher model analyzes mistakes from the student model to generate reasons and insights for preventing similar mistakes. These mistakes are clustered based on their underlying reasons for developing task-level principles, enhancing the error coverage of principles. During inference, the most relevant mistakes for each question are retrieved to create question-level principles, improving the customization of the provided guidance. RICP is orthogonal to existing prompting methods and does not require intervention from the teacher model during inference. Experimental results across seven reasoning benchmarks reveal that RICP effectively enhances performance when applied to various prompting strategies.
pdf
bib
abs
Towards Verifiable Text Generation with Evolving Memory and Self-Reflection
Hao Sun
|
Hengyi Cai
|
Bo Wang
|
Yingyan Hou
|
Xiaochi Wei
|
Shuaiqiang Wang
|
Yan Zhang
|
Dawei Yin
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Despite the remarkable ability of large language models (LLMs) in language comprehension and generation, they often suffer from producing factually incorrect information, also known as hallucination. A promising solution to this issue is verifiable text generation, which prompts LLMs to generate content with citations for accuracy verification. However, verifiable text generation is non-trivial due to the focus-shifting phenomenon, the intricate reasoning needed to align the claim with correct citations, and the dilemma between the precision and breadth of retrieved documents. In this paper, we present VTG, an innovative framework for Verifiable Text Generation with evolving memory and self-reflection. VTG introduces evolving long short-term memory to retain both valuable documents and recent documents. A two-tier verifier equipped with an evidence finder is proposed to rethink and reflect on the relationship between the claim and citations. Furthermore, active retrieval and diverse query generation are utilized to enhance both the precision and breadth of the retrieved documents. We conduct extensive experiments on five datasets across three knowledge-intensive tasks and the results reveal that VTG significantly outperforms baselines.
pdf
bib
abs
Memorize Step by Step: Efficient Long-Context Prefilling with Incremental Memory and Decremental Chunk
Zhiyuan Zeng
|
Qipeng Guo
|
Xiaoran Liu
|
Zhangyue Yin
|
Wentao Shu
|
Mianqiu Huang
|
Bo Wang
|
Yunhua Zhou
|
Linlin Li
|
Qun Liu
|
Xipeng Qiu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
The evolution of Large Language Models (LLMs) has led to significant advancements, with models like Claude and Gemini capable of processing contexts up to 1 million tokens. However, efficiently handling long sequences remains challenging, particularly during the prefilling stage when input lengths exceed GPU memory capacity. Traditional methods often segment sequence into chunks and compress them iteratively with fixed-size memory. However, our empirical analysis shows that the fixed-size memory results in wasted computational and GPU memory resources. Therefore, we introduces Incremental Memory (IM), a method that starts with a small memory size and gradually increases it, optimizing computational efficiency. Additionally, we propose Decremental Chunk based on Incremental Memory (IMDC), which reduces chunk size while increasing memory size, ensuring stable and lower GPU memory usage. Our experiments demonstrate that IMDC is consistently faster (1.45x) and reduces GPU memory consumption by 23.3% compared to fixed-size memory, achieving comparable performance on the LongBench Benchmark.
pdf
bib
abs
SparkRA: A Retrieval-Augmented Knowledge Service System Based on Spark Large Language Model
Dayong Wu
|
Jiaqi Li
|
Baoxin Wang
|
Honghong Zhao
|
Siyuan Xue
|
Yanjie Yang
|
Zhijun Chang
|
Rui Zhang
|
Li Qian
|
Bo Wang
|
Shijin Wang
|
Zhixiong Zhang
|
Guoping Hu
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
Large language models (LLMs) have shown remarkable achievements across various language tasks. To enhance the performance of LLMs in scientific literature services, we developed the scientific literature LLM (SciLit-LLM) through pre-training and supervised fine-tuning on scientific literature, building upon the iFLYTEK Spark LLM. Furthermore, we present a knowledge service system Spark Research Assistant (SparkRA) based on our SciLit-LLM. SparkRA is accessible online and provides three primary functions: literature investigation, paper reading, and academic writing. As of July 30, 2024, SparkRA has garnered over 50,000 registered users, with a total usage count exceeding 1.3 million.
pdf
bib
abs
A + B: A General Generator-Reader Framework for Optimizing LLMs to Unleash Synergy Potential
Wei Tang
|
Yixin Cao
|
Jiahao Ying
|
Bo Wang
|
Yuyue Zhao
|
Yong Liao
|
Peng Zhou
Findings of the Association for Computational Linguistics: ACL 2024
Retrieval-Augmented Generation (RAG) is an effective solution to supplement necessary knowledge to large language models (LLMs). Targeting its bottleneck of retriever performance, “generate-then-read” pipeline is proposed to replace the retrieval stage with generation from the LLM itself. Although promising, this research direction is underexplored and still cannot work in the scenario when source knowledge is given. In this paper, we formalize a general “A + B” framework with varying combinations of foundation models and types for systematic investigation. We explore the efficacy of the base and chat versions of LLMs and found their different functionalities suitable for generator A and reader B, respectively. Their combinations consistently outperform single models, especially in complex scenarios. Furthermore, we extend the application of the “A + B” framework to scenarios involving source documents through continuous learning, enabling the direct integration of external knowledge into LLMs. This approach not only facilitates effective acquisition of new knowledge but also addresses the challenges of safety and helpfulness post-adaptation. The paper underscores the versatility of the “A + B” framework, demonstrating its potential to enhance the practical application of LLMs across various domains.
pdf
bib
abs
Reinforcement Tuning for Detecting Stances and Debunking Rumors Jointly with Large Language Models
Ruichao Yang
|
Wei Gao
|
Jing Ma
|
Hongzhan Lin
|
Bo Wang
Findings of the Association for Computational Linguistics: ACL 2024
Learning multi-task models for jointly detecting stance and verifying rumors poses challenges due to the need for training data of stance at post level and rumor veracity at claim level, which are difficult to obtain. To address this issue, we leverage large language models (LLMs) as the foundation annotators for the joint stance detection (SD) and rumor verification (RV) tasks, dubbed as JSDRV. We introduce a novel reinforcement tuning framework to enhance the joint predictive capabilities of LLM-based SD and RV components. Specifically, we devise a policy for selecting LLM-annotated data at the two levels, employing a hybrid reward mechanism to choose high-quality labels for effective LLM fine-tuning on both tasks. Results demonstrate that JSDRV improves the capabilities of LLMs in the joint tasks, not only outperforming state-of-the-art methods but also generalizing to non-LLMs accommodated as task models.
pdf
bib
abs
QRMeM: Unleash the Length Limitation through Question then Reflection Memory Mechanism
Bo Wang
|
Heyan Huang
|
Yixin Cao
|
Jiahao Ying
|
Wei Tang
|
Chong Feng
Findings of the Association for Computational Linguistics: EMNLP 2024
While LLMs have made notable advancements in natural language processing, they continue to struggle with processing extensive text. Memory mechanisms offer a flexible solution for managing long contexts, utilizing techniques such as compression, summarization, and structuring to facilitate nuanced and efficient handling of large volumes of text. However, existing techniques face challenges with static knowledge integration, leading to insufficient adaptation to task-specific needs and missing multi-segmentation relationships, which hinders the dynamic reorganization and logical combination of relevant segments during the response process. To address these issues, we introduce a novel strategy, Question then Reflection Memory Mechanism (QRMeM), which incorporates a dual-structured memory pool. This pool synergizes static textual content with structured graph guidance, fostering a reflective trial-and-error approach for navigating and identifying relevant segments. Our evaluation across multiple-choice questions (MCQ) and multi-document question answering (Multi-doc QA) benchmarks showcases QRMeM’s enhanced performance compared to existing approaches.
pdf
bib
abs
LLMs-as-Instructors: Learning from Errors Toward Automating Model Improvement
Jiahao Ying
|
Mingbao Lin
|
Yixin Cao
|
Wei Tang
|
Bo Wang
|
Qianru Sun
|
Xuanjing Huang
|
Shuicheng Yan
Findings of the Association for Computational Linguistics: EMNLP 2024
This paper introduces the innovative “LLMs-as-Instructors” framework, which leverages the advanced Large Language Models (LLMs) to autonomously enhance the training of smaller target models. Inspired by the theory of “Learning from Errors”, this framework employs an instructor LLM to meticulously analyze the specific errors within a target model, facilitating targeted and efficient training cycles. Within this framework, we implement two strategies: “Learning from Error,” which focuses solely on incorrect responses to tailor training data, and “Learning from Error by Contrast,” which uses contrastive learning to analyze both correct and incorrect responses for a deeper understanding of errors. Our empirical studies, conducted with several open-source models, demonstrate significant improvements across multiple benchmarks, including mathematical reasoning, coding abilities, and factual knowledge. Notably, the refined Llama-3-8b-Instruction has outperformed ChatGPT, illustrating the effectiveness of our approach. By leveraging the strengths of both strategies, we have attained a more balanced performance improvement on both in-domain and out-of-domain benchmarks.
pdf
bib
abs
A Comparative Study of Explicit and Implicit Gender Biases in Large Language Models via Self-evaluation
Yachao Zhao
|
Bo Wang
|
Yan Wang
|
Dongming Zhao
|
Xiaojia Jin
|
Jijun Zhang
|
Ruifang He
|
Yuexian Hou
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
While extensive work has examined the explicit and implicit biases in large language models (LLMs), little research explores the relation between these two types of biases. This paper presents a comparative study of the explicit and implicit biases in LLMs grounded in social psychology. Social psychology distinguishes between explicit and implicit biases by whether the bias can be self-recognized by individuals. Aligning with this conceptualization, we propose a self-evaluation-based two-stage measurement of explicit and implicit biases within LLMs. First, the LLM is prompted to automatically fill templates with social targets to measure implicit bias toward these targets, where the bias is less likely to be self-recognized by the LLM. Then, the LLM is prompted to self-evaluate the templates filled by itself to measure explicit bias toward the same targets, where the bias is more likely to be self-recognized by the LLM. Experiments conducted on state-of-the-art LLMs reveal human-like inconsistency between explicit and implicit occupational gender biases. This work bridges a critical gap where prior studies concentrate solely on either explicit or implicit bias. We advocate that future work highlight the relation between explicit and implicit biases in LLMs.
pdf
bib
abs
Continuous Relational Diffusion Driven Topic Model with Multi-grained Text for Microblog
Chenhao Wu
|
Ruifang He
|
Chang Liu
|
Bo Wang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Topic model is a statistical model that leverages unsupervised learning to mine hidden topics in document collections. The data sparsity and colloquialism of social texts make it difficult to accurately mine the topics. Traditional methods assume that there are only 0/1-state relationships between the two parties in the social networks, but the relationship status in real life is more complicated, such as continuously changing relationships with different degrees of intimacy. This paper proposes a continuous relational diffusion driven topic model (CRTM) with multi-grained text for microblog to realize the continuous representation of the relationship state and make up for the context and structural information lost by previous representation methods. Multi-grained text representation learning distinguishes the impact of formal and informal expression on the topics further and alleviates colloquialism problems. Specifically, based on the original social network, the reconstructed social network with continuous relationship status is obtained by using information diffusion technology. The graph convolution model is utilized to learn node embeddings through the new social network. Finally, the neural variational inference is applied to generate topics according to continuous relationships. We validate CRTM on three real datasets, and the experimental results show the effectiveness of the scheme.
pdf
bib
abs
Emotion Recognition in Conversation via Dynamic Personality
Yan Wang
|
Bo Wang
|
Yachao Zhao
|
Dongming Zhao
|
Xiaojia Jin
|
Jijun Zhang
|
Ruifang He
|
Yuexian Hou
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Emotion recognition in conversation (ERC) is a field that aims to classify the emotion of each utterance within conversational contexts. This presents significant challenges, particularly in handling emotional ambiguity across various speakers and contextual factors. Existing ERC approaches have primarily focused on modeling conversational contexts while incorporating only superficial speaker attributes such as names, memories, and interactions. Recent works introduce personality as an essential deep speaker factor for emotion recognition, but relies on static personality, overlooking dynamic variability during conversations. Advances in personality psychology conceptualize personality as dynamic, proposing that personality states can change across situations. In this paper, we introduce ERC-DP, a novel model considering the dynamic personality of speakers during conversations. ERC-DP accounts for past utterances from the same speaker as situation impacting dynamic personality. It combines personality modeling with prompt design and fine-grained classification modules. Through a series of comprehensive experiments, ERC-DP demonstrates superior performance on three benchmark conversational datasets.
pdf
bib
abs
Global and Local Hierarchical Prompt Tuning Framework for Multi-level Implicit Discourse Relation Recognition
Lei Zeng
|
Ruifang He
|
Haowen Sun
|
Jing Xu
|
Chang Liu
|
Bo Wang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Multi-level implicit discourse relation recognition (MIDRR) is a challenging task to recognize the hierarchical discourse relations between the arguments with the absence of connectives. Recent methods tend to incorporate the static hierarchical structure containing all senses (defined as global hierarchy) into prompt tuning through a path prompt template or hierarchical label refining. Howerver, hierarchical modeling is independent of the verbalizer, resulting in a failure to effectively utilize the output probability distribution information of verbalizer. Besides, they ignore the utilization of the dynamic hierarchical label sequence for each instance (defined as local hierarchy) in prompt tuning. In this paper, we propose a global and local hierarchical prompt tuning (GLHPT) framework, which utilize prior knowledge of PLMs while better incorporating hierarchical information from two aspects. We leverage bottom-up propagated probability as the global hierarchy to inject it into multi-level verbalizer (MLV). Furthermore, we design a local hierarchy-driven contrastive learning (LHCL) to improve the probability distribution of MLV. Finally, our model achieves competitive results on two benchmacks.
pdf
bib
abs
Representation Degeneration Problem in Prompt-based Models for Natural Language Understanding
Qingyan Zhao
|
Ruifang He
|
Jinpeng Zhang
|
Chang Liu
|
Bo Wang
Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)
Prompt-based fine-tuning (PF), by aligning with the training objective of pre-trained language models (PLMs), has shown improved performance on many few-shot natural language understanding (NLU) benchmarks. However, the word embedding space of PLMs exhibits anisotropy, which is called the representation degeneration problem. In this paper, we explore the self-similarity within the same context and identify the anisotropy of the feature embedding space in PF model. Given that the performance of PF models is dependent on feature embeddings, we inevitably pose the hypothesis: this anisotropy limits the performance of the PF models. Based on our experimental findings, we propose CLMA, a Contrastive Learning framework based on the [MASK] token and Answers, to alleviate the anisotropy in the embedding space. By combining our proposed counter-intuitive SSD, a Supervised Signal based on embedding Distance, our approach outperforms mainstream methods on the many NLU benchmarks in the few-shot experimental settings. In subsequent experiments, we analyze the capability of our method to capture deep semantic cues and the impact of the anisotropy in the feature embedding space on the performance of the PF model.
pdf
bib
abs
Jina-ColBERT-v2: A General-Purpose Multilingual Late Interaction Retriever
Rohan Jha
|
Bo Wang
|
Michael Günther
|
Georgios Mastrapas
|
Saba Sturua
|
Isabelle Mohr
|
Andreas Koukounas
|
Mohammad Kalim Wang
|
Nan Wang
|
Han Xiao
Proceedings of the Fourth Workshop on Multilingual Representation Learning (MRL 2024)
Multi-vector dense models, such as ColBERT, have proven highly effective in information retrieval. ColBERT’s late interaction scoring approximates the joint query-document attention seen in cross-encoders while maintaining inference efficiency closer to traditional dense retrieval models, thanks to its bi-encoder architecture and recent optimizations in indexing and search. In this paper, we introduce a novel architecture and a training framework to support long context window and multilingual retrieval. Leveraging Matryoshka Representation Loss, we further demonstrate that the reducing the embedding dimensionality from 128 to 64 has insignificant impact on the model’s retrieval performance and cut storage requirements by up to 50%. Our new model, Jina-ColBERT-v2, demonstrates strong performance across a range of English and multilingual retrieval tasks,
2023
pdf
bib
abs
Facilitating Multi-turn Emotional Support Conversation with Positive Emotion Elicitation: A Reinforcement Learning Approach
Jinfeng Zhou
|
Zhuang Chen
|
Bo Wang
|
Minlie Huang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Emotional support conversation (ESC) aims to provide emotional support (ES) to improve one’s mental state. Existing works stay at fitting grounded responses and responding strategies (e.g., question), which ignore the effect on ES and lack explicit goals to guide emotional positive transition. To this end, we introduce a new paradigm to formalize multi-turn ESC as a process of positive emotion elicitation. Addressing this task requires finely adjusting the elicitation intensity in ES as the conversation progresses while maintaining conversational goals like coherence. In this paper, we propose Supporter, a mixture-of-expert-based reinforcement learning model, and well design ES and dialogue coherence rewards to guide policy’s learning for responding. Experiments verify the superiority of Supporter in achieving positive emotion elicitation during responding while maintaining conversational goals including coherence.
pdf
bib
abs
Enhancing Personalized Dialogue Generation with Contrastive Latent Variables: Combining Sparse and Dense Persona
Yihong Tang
|
Bo Wang
|
Miao Fang
|
Dongming Zhao
|
Kun Huang
|
Ruifang He
|
Yuexian Hou
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The personalized dialogue explores the consistent relationship between dialogue generation and personality. Existing personalized dialogue agents model persona profiles from three resources: sparse or dense persona descriptions and dialogue histories. However, sparse structured persona attributes are explicit but uninformative, dense persona texts contain rich persona descriptions with much noise, and dialogue history query is both noisy and uninformative for persona modeling. In this work, we combine the advantages of the three resources to obtain a richer and more accurate persona. We design a Contrastive Latent Variable-based model (CLV) that clusters the dense persona descriptions into sparse categories, which are combined with the history query to generate personalized responses. Experimental results on Chinese and English datasets demonstrate our model’s superiority in personalization.
pdf
bib
abs
CASE: Aligning Coarse-to-Fine Cognition and Affection for Empathetic Response Generation
Jinfeng Zhou
|
Chujie Zheng
|
Bo Wang
|
Zheng Zhang
|
Minlie Huang
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Empathetic conversation is psychologically supposed to be the result of conscious alignment and interaction between the cognition and affection of empathy. However, existing empathetic dialogue models usually consider only the affective aspect or treat cognition and affection in isolation, which limits the capability of empathetic response generation. In this work, we propose the CASE model for empathetic dialogue generation. It first builds upon a commonsense cognition graph and an emotional concept graph and then aligns the user’s cognition and affection at both the coarse-grained and fine-grained levels. Through automatic and manual evaluation, we demonstrate that CASE outperforms state-of-the-art baselines of empathetic dialogues and can generate more empathetic and informative responses.
pdf
bib
abs
WangLab at MEDIQA-Chat 2023: Clinical Note Generation from Doctor-Patient Conversations using Large Language Models
John Giorgi
|
Augustin Toma
|
Ronald Xie
|
Sondra Chen
|
Kevin An
|
Grace Zheng
|
Bo Wang
Proceedings of the 5th Clinical Natural Language Processing Workshop
This paper describes our submission to the MEDIQA-Chat 2023 shared task for automatic clinical note generation from doctor-patient conversations. We report results for two approaches: the first fine-tunes a pre-trained language model (PLM) on the shared task data, and the second uses few-shot in-context learning (ICL) with a large language model (LLM). Both achieve high performance as measured by automatic metrics (e.g. ROUGE, BERTScore) and ranked second and first, respectively, of all submissions to the shared task. Expert human scrutiny indicates that notes generated via the ICL-based approach with GPT-4 are preferred about as often as human-written notes, making it a promising path toward automated note generation from doctor-patient conversations.
pdf
bib
abs
MTGP: Multi-turn Target-oriented Dialogue Guided by Generative Global Path with Flexible Turns
Anqi Liu
|
Bo Wang
|
Yue Tan
|
Dongming Zhao
|
Kun Huang
|
Ruifang He
|
Yuexian Hou
Findings of the Association for Computational Linguistics: ACL 2023
Target-oriented dialogue guides the dialogue to a target quickly and smoothly. The latest approaches focus on global planning, which plans toward the target before the conversation instead of adopting a greedy strategy during the conversation. However, the global plan in existing works is fixed to certain turns by generating paths with certain nodes, which limits the optimization of turns and coherence of the target-oriented process. Toward flexible global planning, we propose to generate a global path as a natural language sentence instead of a sequence of nodes. With this path, the dialog is guided to the target with flexible turns of dialog. For model training, we also extract targetoriented dialogues from the chit-chat corpus with a knowledge graph. We conduct experiments on three datasets and simulate scenarios with and without user participation. The results show that our method has fewer turns, more coherent semantics, and a higher success rate in reaching the target than baselines.
pdf
bib
abs
Guiding Dialogue Agents to Complex Semantic Targets by Dynamically Completing Knowledge Graph
Yue Tan
|
Bo Wang
|
Anqi Liu
|
Dongming Zhao
|
Kun Huang
|
Ruifang He
|
Yuexian Hou
Findings of the Association for Computational Linguistics: ACL 2023
In the target-oriented dialogue, the representation and achievement of targets are two interrelated essential issues. In current approaches, the target is typically supposed to be a single object represented as a word, which makes it relatively easy to achieve the target through dialogue with the help of a knowledge graph (KG). However, when the target has complex semantics, the existing knowledge graph is often incomplete in tracking complex semantic relations. This paper studies target-oriented dialog where the target is a topic sentence. We combine the methods of knowledge retrieval and relationship prediction to construct a context-related dynamic KG. On dynamic KG, we can track the implicit semantic paths in the speaker’s mind that may not exist in the existing KGs. In addition, we also designed a novel metric to evaluate the tracked path automatically. The experimental results show that our method can control the agent more logically and smoothly toward the complex target.
pdf
bib
abs
Boosting Event Extraction with Denoised Structure-to-Text Augmentation
Bo Wang
|
Heyan Huang
|
Xiaochi Wei
|
Ge Shi
|
Xiao Liu
|
Chong Feng
|
Tong Zhou
|
Shuaiqiang Wang
|
Dawei Yin
Findings of the Association for Computational Linguistics: ACL 2023
Event extraction aims to recognize pre-defined event triggers and arguments from texts, which suffer from the lack of high-quality annotations. In most NLP applications, involving a large scale of synthetic training data is a practical and effective approach to alleviate the problem of data scarcity. However, when applying to the task of event extraction, recent data augmentation methods often neglect the problem of grammatical incorrectness, structure misalignment, and semantic drifting, leading to unsatisfactory performances. In order to solve these problems, we propose a denoised structure-to-text augmentation framework for event extraction (DAEE), which generates additional training data through the knowledge-based structure-to-text generation model and selects the effective subset from the generated data iteratively with a deep reinforcement learning agent. Experimental results on several datasets demonstrate that the proposed method generates more diverse text representations for event extraction and achieves comparable results with the state-of-the-art.
pdf
bib
abs
Open Domain Multi-document Summarization: A Comprehensive Study of Model Brittleness under Retrieval
John Giorgi
|
Luca Soldaini
|
Bo Wang
|
Gary Bader
|
Kyle Lo
|
Lucy Wang
|
Arman Cohan
Findings of the Association for Computational Linguistics: EMNLP 2023
Multi-document summarization (MDS) assumes a set of topic-related documents are provided as input. In practice, this document set is not always available; it would need to be retrieved given an information need, i.e. a question or topic statement, a setting we dub “open-domain’ MDS. We study this more challenging setting by formalizing the task and bootstrapping it using existing datasets, retrievers and summarizers. Via extensive automatic and human evaluation, we determine: (1) state-of-the-art summarizers suffer large reductions in performance when applied to open-domain MDS, (2) additional training in the open-domain setting can reduce this sensitivity to imperfect retrieval, and (3) summarizers are insensitive to the retrieval of duplicate documents and the order of retrieved documents, but highly sensitive to other errors, like the retrieval of irrelevant documents. Based on our results, we provide practical guidelines to enable future work on open-domain MDS, e.g. how to choose the number of retrieved documents to summarize. Our results suggest that new retrieval and summarization methods and annotated resources for training and evaluation are necessary for further progress in the open-domain setting.
pdf
bib
abs
Jina Embeddings: A Novel Set of High-Performance Sentence Embedding Models
Michael Günther
|
Louis Milliken
|
Jonathan Geuter
|
Georgios Mastrapas
|
Bo Wang
|
Han Xiao
Proceedings of the 3rd Workshop for Natural Language Processing Open Source Software (NLP-OSS 2023)
Jina Embeddings constitutes a set of high-performance sentence embedding models adept at translating textual inputs into numerical representations, capturing the semantics of the text. These models excel in applications like dense retrieval and semantic textual similarity. This paper details the development of Jina Embeddings, starting with the creation of high-quality pairwise and triplet datasets.It underlines the crucial role of data cleaning in dataset preparation, offers in-depth insights into the model training process, and concludes with a comprehensive performance evaluation using the Massive Text Embedding Benchmark (MTEB). Furthermore, to increase the model’s awareness of grammatical negation, we construct a novel training and evaluation dataset of negated and non-negated statements, which we make publicly available to the community.
2022
pdf
bib
abs
Dynamic Prefix-Tuning for Generative Template-based Event Extraction
Xiao Liu
|
Heyan Huang
|
Ge Shi
|
Bo Wang
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We consider event extraction in a generative manner with template-based conditional generation. Although there is a rising trend of casting the task of event extraction as a sequence generation problem with prompts, these generation-based methods have two significant challenges, including using suboptimal prompts and static event type information. In this paper, we propose a generative template-based event extraction method with dynamic prefix (GTEE-DynPref) by integrating context information with type-specific prefixes to learn a context-specific prefix for each context. Experimental results show that our model achieves competitive results with the state-of-the-art classification-based model OneIE on ACE 2005 and achieves the best performances on ERE.Additionally, our model is proven to be portable to new types of events effectively.
pdf
bib
abs
Dataset Debt in Biomedical Language Modeling
Jason Fries
|
Natasha Seelam
|
Gabriel Altay
|
Leon Weber
|
Myungsun Kang
|
Debajyoti Datta
|
Ruisi Su
|
Samuele Garda
|
Bo Wang
|
Simon Ott
|
Matthias Samwald
|
Wojciech Kusa
Proceedings of BigScience Episode #5 -- Workshop on Challenges & Perspectives in Creating Large Language Models
Large-scale language modeling and natural language prompting have demonstrated exciting capabilities for few and zero shot learning in NLP. However, translating these successes to specialized domains such as biomedicine remains challenging, due in part to biomedical NLP’s significant dataset debt – the technical costs associated with data that are not consistently documented or easily incorporated into popular machine learning frameworks at scale. To assess this debt, we crowdsourced curation of datasheets for 167 biomedical datasets. We find that only 13% of datasets are available via programmatic access and 30% lack any documentation on licensing and permitted reuse. Our dataset catalog is available at:
https://tinyurl.com/bigbio22.
pdf
bib
abs
A sequence-to-sequence approach for document-level relation extraction
John Giorgi
|
Gary Bader
|
Bo Wang
Proceedings of the 21st Workshop on Biomedical Language Processing
Motivated by the fact that many relations cross the sentence boundary, there has been increasing interest in document-level relation extraction (DocRE). DocRE requires integrating information within and across sentences, capturing complex interactions between mentions of entities. Most existing methods are pipeline-based, requiring entities as input. However, jointly learning to extract entities and relations can improve performance and be more efficient due to shared parameters and training steps. In this paper, we develop a sequence-to-sequence approach, seq2rel, that can learn the subtasks of DocRE (entity extraction, coreference resolution and relation extraction) end-to-end, replacing a pipeline of task-specific components. Using a simple strategy we call entity hinting, we compare our approach to existing pipeline-based methods on several popular biomedical datasets, in some cases exceeding their performance. We also report the first end-to-end results on these datasets for future comparison. Finally, we demonstrate that, under our model, an end-to-end approach outperforms a pipeline-based approach. Our code, data and trained models are available at
https://github.com/johngiorgi/seq2rel. An online demo is available at
https://share.streamlit.io/johngiorgi/seq2rel/main/demo.py.
pdf
bib
abs
CR-GIS: Improving Conversational Recommendation via Goal-aware Interest Sequence Modeling
Jinfeng Zhou
|
Bo Wang
|
Zhitong Yang
|
Dongming Zhao
|
Kun Huang
|
Ruifang He
|
Yuexian Hou
Proceedings of the 29th International Conference on Computational Linguistics
Conversational recommendation systems (CRS) aim to determine a goal item by sequentially tracking users’ interests through multi-turn conversation. In CRS, implicit patterns of user interest sequence guide the smooth transition of dialog utterances to the goal item. However, with the convenient explicit knowledge of knowledge graph (KG), existing KG-based CRS methods over-rely on the explicit separate KG links to model the user interests but ignore the rich goal-aware implicit interest sequence patterns in a dialog. In addition, interest sequence is also not fully used to generate smooth transited utterances. We propose CR-GIS with a parallel star framework. First, an interest-level star graph is designed to model the goal-aware implicit user interest sequence. Second, a hierarchical Star Transformer is designed to guide the multi-turn utterances generation with the interest-level star graph. Extensive experiments verify the effectiveness of CR-GIS in achieving more accurate recommended items with more fluent and coherent dialog utterances.
pdf
bib
abs
TopKG: Target-oriented Dialog via Global Planning on Knowledge Graph
Zhitong Yang
|
Bo Wang
|
Jinfeng Zhou
|
Yue Tan
|
Dongming Zhao
|
Kun Huang
|
Ruifang He
|
Yuexian Hou
Proceedings of the 29th International Conference on Computational Linguistics
Target-oriented dialog aims to reach a global target through multi-turn conversation. The key to the task is the global planning towards the target, which flexibly guides the dialog concerning the context. However, existing target-oriented dialog works take a local and greedy strategy for response generation, where global planning is absent. In this work, we propose global planning for target-oriented dialog on a commonsense knowledge graph (KG). We design a global reinforcement learning with the planned paths to flexibly adjust the local response generation model towards the global target. We also propose a KG-based method to collect target-oriented samples automatically from the chit-chat corpus for model training. Experiments show that our method can reach the target with a higher success rate, fewer turns, and more coherent responses.
pdf
bib
abs
Multi-Attribute Controlled Text Generation with Contrastive-Generator and External-Discriminator
Guisheng Liu
|
Yi Li
|
Yanqing Guo
|
Xiangyang Luo
|
Bo Wang
Proceedings of the 29th International Conference on Computational Linguistics
Though existing researches have achieved impressive results in controlled text generation, they focus mainly on single-attribute control. However, in applications like automatic comments, the topic and sentiment need to be controlled simultaneously. In this work, we propose a new framework for multi-attribute controlled text generation. To achieve this, we design a contrastive-generator that can effectively generate texts with more attributes. In order to increase the convergence of the text on the desired attributes, we adopt an external-discriminator to distinguish whether the generated text holds the desired attributes. Moreover, we propose top-n weighted decoding to further improve the relevance of texts to attributes. Automated evaluations and human evaluations show that our framework achieves remarkable controllability in multi-attribute generation while keeping the text fluent and diverse. It also yields promising performance on zero-shot generation.
pdf
bib
abs
Aligning Recommendation and Conversation via Dual Imitation
Jinfeng Zhou
|
Bo Wang
|
Minlie Huang
|
Dongming Zhao
|
Kun Huang
|
Ruifang He
|
Yuexian Hou
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Human conversations of recommendation naturally involve the shift of interests which can align the recommendation actions and conversation process to make accurate recommendations with rich explanations. However, existing conversational recommendation systems (CRS) ignore the advantage of user interest shift in connecting recommendation and conversation, which leads to an ineffective loose coupling structure of CRS. To address this issue, by modeling the recommendation actions as recommendation paths in a knowledge graph (KG), we propose DICR (Dual Imitation for Conversational Recommendation), which designs a dual imitation to explicitly align the recommendation paths and user interest shift paths in a recommendation module and a conversation module, respectively. By exchanging alignment signals, DICR achieves bidirectional promotion between recommendation and conversation modules and generates high-quality responses with accurate recommendations and coherent explanations. Experiments demonstrate that DICR outperforms the state-of-the-art models on recommendation and conversation performance with automatic, human, and novel explainability metrics.
pdf
bib
abs
CodeExp: Explanatory Code Document Generation
Haotian Cui
|
Chenglong Wang
|
Junjie Huang
|
Jeevana Priya Inala
|
Todd Mytkowicz
|
Bo Wang
|
Jianfeng Gao
|
Nan Duan
Findings of the Association for Computational Linguistics: EMNLP 2022
Developing models that can automatically generate detailed code explanation can greatly benefit software maintenance and programming education. However, existing code-to-text generation models often produce only high-level summaries of code that do not capture implementation-level choices essential for these scenarios. To fill in this gap, we propose the code explanation generation task. We first conducted a human study to identify the criteria for high-quality explanatory docstring for code. Based on that, we collected and refined a large-scale code docstring corpus and formulated automatic evaluation metrics that best match human assessments. Finally, we present a multi-stage fine-tuning strategy and baseline models for the task. Our experiments show that (1) our refined training dataset lets models achieve better performance in the explanation generation tasks compared to larger-scale unrefined data (15x larger), and (2) fine-tuned models can generate well-structured long docstrings comparable to human-written ones. We envision our training dataset, human-evaluation protocol, recommended metrics, and fine-tuning strategy can boost future code explanation research. The code and annotated data are available at https://github.com/subercui/CodeExp.
pdf
bib
abs
Template-based Abstractive Microblog Opinion Summarization
Iman Munire Bilal
|
Bo Wang
|
Adam Tsakalidis
|
Dong Nguyen
|
Rob Procter
|
Maria Liakata
Transactions of the Association for Computational Linguistics, Volume 10
We introduce the task of microblog opinion summarization (MOS) and share a dataset of 3100 gold-standard opinion summaries to facilitate research in this domain. The dataset contains summaries of tweets spanning a 2-year period and covers more topics than any other public Twitter summarization dataset. Summaries are abstractive in nature and have been created by journalists skilled in summarizing news articles following a template separating factual information (main story) from author opinions. Our method differs from previous work on generating gold-standard summaries from social media, which usually involves selecting representative posts and thus favors extractive summarization models. To showcase the dataset’s utility and challenges, we benchmark a range of abstractive and extractive state-of-the-art summarization models and achieve good performance, with the former outperforming the latter. We also show that fine-tuning is necessary to improve performance and investigate the benefits of using different sample sizes.
2021
pdf
bib
abs
DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations
John Giorgi
|
Osvald Nitski
|
Bo Wang
|
Gary Bader
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Sentence embeddings are an important component of many natural language processing (NLP) systems. Like word embeddings, sentence embeddings are typically learned on large text corpora and then transferred to various downstream tasks, such as clustering and retrieval. Unlike word embeddings, the highest performing solutions for learning sentence embeddings require labelled data, limiting their usefulness to languages and domains where labelled data is abundant. In this paper, we present DeCLUTR: Deep Contrastive Learning for Unsupervised Textual Representations. Inspired by recent advances in deep metric learning (DML), we carefully design a self-supervised objective for learning universal sentence embeddings that does not require labelled training data. When used to extend the pretraining of transformer-based language models, our approach closes the performance gap between unsupervised and supervised pretraining for universal sentence encoders. Importantly, our experiments suggest that the quality of the learned embeddings scale with both the number of trainable parameters and the amount of unlabelled training data. Our code and pretrained models are publicly available and can be easily adapted to new domains or used to embed unseen text.
pdf
bib
abs
Evaluation of Thematic Coherence in Microblogs
Iman Munire Bilal
|
Bo Wang
|
Maria Liakata
|
Rob Procter
|
Adam Tsakalidis
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers)
Collecting together microblogs representing opinions about the same topics within the same timeframe is useful to a number of different tasks and practitioners. A major question is how to evaluate the quality of such thematic clusters. Here we create a corpus of microblog clusters from three different domains and time windows and define the task of evaluating thematic coherence. We provide annotation guidelines and human annotations of thematic coherence by journalist experts. We subsequently investigate the efficacy of different automated evaluation metrics for the task. We consider a range of metrics including surface level metrics, ones for topic model coherence and text generation metrics (TGMs). While surface level metrics perform well, outperforming topic coherence metrics, they are not as consistent as TGMs. TGMs are more reliable than all other metrics considered for capturing thematic coherence in microblog clusters due to being less sensitive to the effect of time windows.
pdf
bib
abs
CRFR: Improving Conversational Recommender Systems via Flexible Fragments Reasoning on Knowledge Graphs
Jinfeng Zhou
|
Bo Wang
|
Ruifang He
|
Yuexian Hou
Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
Although paths of user interests shift in knowledge graphs (KGs) can benefit conversational recommender systems (CRS), explicit reasoning on KGs has not been well considered in CRS, due to the complex of high-order and incomplete paths. We propose CRFR, which effectively does explicit multi-hop reasoning on KGs with a conversational context-based reinforcement learning model. Considering the incompleteness of KGs, instead of learning single complete reasoning path, CRFR flexibly learns multiple reasoning fragments which are likely contained in the complete paths of interests shift. A fragments-aware unified model is then designed to fuse the fragments information from item-oriented and concept-oriented KGs to enhance the CRS response with entities and words from the fragments. Extensive experiments demonstrate CRFR’s SOTA performance on recommendation, conversation and conversation interpretability.
pdf
bib
abs
Eliminating Sentiment Bias for Aspect-Level Sentiment Classification with Unsupervised Opinion Extraction
Bo Wang
|
Tao Shen
|
Guodong Long
|
Tianyi Zhou
|
Yi Chang
Findings of the Association for Computational Linguistics: EMNLP 2021
Aspect-level sentiment classification (ALSC) aims at identifying the sentiment polarity of a specified aspect in a sentence. ALSC is a practical setting in aspect-based sentiment analysis due to no opinion term labeling needed, but it fails to interpret why a sentiment polarity is derived for the aspect. To address this problem, recent works fine-tune pre-trained Transformer encoders for ALSC to extract an aspect-centric dependency tree that can locate the opinion words. However, the induced opinion words only provide an intuitive cue far below human-level interpretability. Besides, the pre-trained encoder tends to internalize an aspect’s intrinsic sentiment, causing sentiment bias and thus affecting model performance. In this paper, we propose a span-based anti-bias aspect representation learning framework. It first eliminates the sentiment bias in the aspect embedding by adversarial learning against aspects’ prior sentiment. Then, it aligns the distilled opinion candidates with the aspect by span-based dependency modeling to highlight the interpretable opinion terms. Our method achieves new state-of-the-art performance on five benchmarks, with the capability of unsupervised opinion extraction.
2020
pdf
bib
abs
Information Extraction from Swedish Medical Prescriptions with Sig-Transformer Encoder
John Pougué Biyong
|
Bo Wang
|
Terry Lyons
|
Alejo Nevado-Holgado
Proceedings of the 3rd Clinical Natural Language Processing Workshop
Relying on large pretrained language models such as Bidirectional Encoder Representations from Transformers (BERT) for encoding and adding a simple prediction layer has led to impressive performance in many clinical natural language processing (NLP) tasks. In this work, we present a novel extension to the Transformer architecture, by incorporating signature transform with the self-attention model. This architecture is added between embedding and prediction layers. Experiments on a new Swedish prescription data show the proposed architecture to be superior in two of the three information extraction tasks, comparing to baseline models. Finally, we evaluate two different embedding approaches between applying Multilingual BERT and translating the Swedish text to English then encode with a BERT model pretrained on clinical notes.
2019
pdf
bib
abs
DeepGeneMD: A Joint Deep Learning Model for Extracting Gene Mutation-Disease Knowledge from PubMed Literature
Feifan Liu
|
Xiaoyu Zheng
|
Bo Wang
|
Catarina Kiefe
Proceedings of the 5th Workshop on BioNLP Open Shared Tasks
Understanding the pathogenesis of genetic diseases through different gene activities and their relations to relevant diseases is important for new drug discovery and drug repositioning. In this paper, we present a joint deep learning model in a multi-task learning paradigm for gene mutation-disease knowledge extraction, DeepGeneMD, which adapts the state-of-the-art hierarchical multi-task learning framework for joint inference on named entity recognition (NER) and relation extraction (RE) in the context of the AGAC (Active Gene Annotation Corpus) track at 2019 BioNLP Open Shared Tasks (BioNLP-OST). It simultaneously extracts gene mutation related activities, diseases, and their relations from the published scientific literature. In DeepGeneMD, we explore the task decomposition to create auxiliary subtasks so that more interactions between different learning subtasks can be leveraged in model training. Our model achieves the average F1 score of 0.45 on recognizing gene activities and disease entities, ranking 2nd in the AGAC NER task; and the average F1 score of 0.35 on extracting relations, ranking 1st in the AGAC RE task.
2018
pdf
bib
abs
OpenNMT System Description for WNMT 2018: 800 words/sec on a single-core CPU
Jean Senellart
|
Dakun Zhang
|
Bo Wang
|
Guillaume Klein
|
Jean-Pierre Ramatchandirin
|
Josep Crego
|
Alexander Rush
Proceedings of the 2nd Workshop on Neural Machine Translation and Generation
We present a system description of the OpenNMT Neural Machine Translation entry for the WNMT 2018 evaluation. In this work, we developed a heavily optimized NMT inference model targeting a high-performance CPU system. The final system uses a combination of four techniques, all of them lead to significant speed-ups in combination: (a) sequence distillation, (b) architecture modifications, (c) precomputation, particularly of vocabulary, and (d) CPU targeted quantization. This work achieves the fastest performance of the shared task, and led to the development of new features that have been integrated to OpenNMT and available to the community.
2017
pdf
bib
abs
TDParse: Multi-target-specific sentiment recognition on Twitter
Bo Wang
|
Maria Liakata
|
Arkaitz Zubiaga
|
Rob Procter
Proceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 1, Long Papers
Existing target-specific sentiment recognition methods consider only a single target per tweet, and have been shown to miss nearly half of the actual targets mentioned. We present a corpus of UK election tweets, with an average of 3.09 entities per tweet and more than one type of sentiment in half of the tweets. This requires a method for multi-target specific sentiment recognition, which we develop by using the context around a target as well as syntactic dependencies involving the target. We present results of our method on both a benchmark corpus of single targets and the multi-target election corpus, showing state-of-the art performance in both corpora and outperforming previous approaches to multi-target sentiment task as well as deep learning models for single-target sentiment.
pdf
bib
abs
TOTEMSS: Topic-based, Temporal Sentiment Summarisation for Twitter
Bo Wang
|
Maria Liakata
|
Adam Tsakalidis
|
Spiros Georgakopoulos Kolaitis
|
Symeon Papadopoulos
|
Lazaros Apostolidis
|
Arkaitz Zubiaga
|
Rob Procter
|
Yiannis Kompatsiaris
Proceedings of the IJCNLP 2017, System Demonstrations
We present a system for time sensitive, topic based summarisation of the sentiment around target entities and topics in collections of tweets. We describe the main elements of the system and illustrate its functionality with two examples of sentiment analysis of topics related to the 2017 UK general election.
pdf
bib
SYSTRAN Purely Neural MT Engines for WMT2017
Yongchao Deng
|
Jungi Kim
|
Guillaume Klein
|
Catherine Kobus
|
Natalia Segal
|
Christophe Servan
|
Bo Wang
|
Dakun Zhang
|
Josep Crego
|
Jean Senellart
Proceedings of the Second Conference on Machine Translation
2015
pdf
bib
WarwickDCS: From Phrase-Based to Target-Specific Sentiment Recognition
Richard Townsend
|
Adam Tsakalidis
|
Yiwei Zhou
|
Bo Wang
|
Maria Liakata
|
Arkaitz Zubiaga
|
Alexandra Cristea
|
Rob Procter
Proceedings of the 9th International Workshop on Semantic Evaluation (SemEval 2015)
2010
pdf
bib
All in Strings: a Powerful String-based Automatic MT Evaluation Metric with Multiple Granularities
Junguo Zhu
|
Muyun Yang
|
Bo Wang
|
Sheng Li
|
Tiejun Zhao
Coling 2010: Posters
2009
pdf
bib
A Statistical Machine Translation Model Based on a Synthetic Synchronous Grammar
Hongfei Jiang
|
Muyun Yang
|
Tiejun Zhao
|
Sheng Li
|
Bo Wang
Proceedings of the ACL-IJCNLP 2009 Conference Short Papers
pdf
bib
References Extension for the Automatic Evaluation of MT by Syntactic Hybridization
Bo Wang
|
Tiejun Zhao
|
Muyun Yang
|
Sheng Li
Proceedings of the Third Workshop on Syntax and Structure in Statistical Translation (SSST-3) at NAACL HLT 2009
2008
pdf
bib
Diagnostic Evaluation of Machine Translation Systems Using Automatically Constructed Linguistic Check-Points
Ming Zhou
|
Bo Wang
|
Shujie Liu
|
Mu Li
|
Dongdong Zhang
|
Tiejun Zhao
Proceedings of the 22nd International Conference on Computational Linguistics (Coling 2008)
pdf
bib
Bootstrapping Both Product Features and Opinion Words from Chinese Customer Reviews with Cross-Inducing
Bo Wang
|
Houfeng Wang
Proceedings of the Third International Joint Conference on Natural Language Processing: Volume-I