Fanghua Ye
2026
Too Good to be Bad: On the Failure of LLMs to Role-Play Villains
Zihao Yi | Qingxuan Jiang | Ruotian Ma | Xingyu Chen | Qu Yang | Mengru Wang | Fanghua Ye | Ying Shen | Zhaopeng Tu | Xiaolong Li | Liefeng Bo
Findings of the Association for Computational Linguistics: ACL 2026
Zihao Yi | Qingxuan Jiang | Ruotian Ma | Xingyu Chen | Qu Yang | Mengru Wang | Fanghua Ye | Ying Shen | Zhaopeng Tu | Xiaolong Li | Liefeng Bo
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) are increasingly tasked with creative generation, including the simulation of fictional characters. However, their ability to portray non-prosocial, antagonistic personas remains largely unexamined. We hypothesize that the safety alignment of modern LLMs creates a fundamental conflict with the task of authentically role-playing morally ambiguous or villainous characters. To investigate this, we introduce the Moral RolePlay benchmark, a new dataset featuring a four-level moral alignment scale and a balanced test set for rigorous evaluation. We task state-of-the-art LLMs with role-playing characters from moral paragons to pure villains. Our large-scale evaluation reveals a consistent, monotonic decline in role-playing fidelity as character morality decreases. We find that models struggle most with traits directly antithetical to safety principles, such as ”Deceitful” and ”Manipulative”, often substituting nuanced malevolence with superficial aggression. Furthermore, we demonstrate that general chatbot proficiency is a poor predictor of villain role-playing ability, with highly safety-aligned models performing particularly poorly. Our work provides the first systematic evidence of this critical limitation, highlighting a key tension between model safety and creative fidelity. Our benchmark and findings pave the way for developing more nuanced, context-aware alignment methods.
Context-Fidelity Boosting: Enhancing Faithful Generation through Watermark-Inspired Decoding
Weixu Zhang | Fanghua Ye | Qiang Gao | Jian Li | Haolun Wu | Yuxing Tian | Sijing Duan | Nan Du | Xiaolong Li
Findings of the Association for Computational Linguistics: ACL 2026
Weixu Zhang | Fanghua Ye | Qiang Gao | Jian Li | Haolun Wu | Yuxing Tian | Sijing Duan | Nan Du | Xiaolong Li
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) often produce content that contradicts or overlooks information provided in the input context, a phenomenon known as faithfulness hallucination. In this paper, we propose Context-Fidelity Boosting (CFB), a lightweight and general decoding-time framework that effectively reduces such hallucinations by boosting the generation probability of context-relevant tokens. Motivated by logit-shaping principles in watermarking techniques, CFB leverages token-level logit adjustments based on their presence or salience in the input context. Specifically, we develop three boosting strategies, static, context-aware, and token-aware that progressively incorporate distributional divergence, attention scores, and semantic similarity. Notably, CFB requires no retraining or architectural changes, making it compatible with a wide range of LLMs. Experiments on summarization and question answering tasks across multiple open-source LLMs show that CFB consistently improves faithfulness metrics, with minimal generation overhead. Our implementation is fully open-sourced.
Social Welfare Function Leaderboard: On the Emergence of LLM Agents as the Welfare Dictator
Zhengliang Shi | Ruotian Ma | Jen-tse Huang | Xinbei Ma | Xingyu Chen | Mengru Wang | Qu Yang | Yue Wang | Fanghua Ye | Ziyang Chen | Shanyi Wang | Cixing LI | Wenxuan Wang | Zhaopeng Tu | Xiaolong Li | Zhaochun Ren | Liefeng Bo
Findings of the Association for Computational Linguistics: ACL 2026
Zhengliang Shi | Ruotian Ma | Jen-tse Huang | Xinbei Ma | Xingyu Chen | Mengru Wang | Qu Yang | Yue Wang | Fanghua Ye | Ziyang Chen | Shanyi Wang | Cixing LI | Wenxuan Wang | Zhaopeng Tu | Xiaolong Li | Zhaochun Ren | Liefeng Bo
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) are increasingly entrusted with high-stakes decisions that affect human welfare. However, the principles and values that guide these models when distributing scarce societal resources remain largely unexamined. To address this, we introduce the Social Welfare Function (SWF) Benchmark, a dynamic simulation environment in which an LLM acts as a dictator, distributing tasks to heterogeneous recipients with different returns on investment (ROI). The benchmark is designed to create a dilemma between maximizing collective efficiency (i.e., overall ROI) and ensuring distributive fairness (measured by the Gini coefficient). We evaluate 20 state-of-the-art LLMs. Our findings reveal several key insights, including: (i) LLMs’ general ability, as measured by popular Arena leaderboards, misaligns with their allocation skills; (ii) Most LLMs exhibit a strong default utilitarian orientation, prioritizing overall productivity at the expense of inequality. (iii) Allocation behaviors are highly manipulated, easily perturbed by common persuasion strategies. These results highlight the risks of deploying current LLMs as societal decision-makers and underscore the need for specialized benchmarks and alignment for AI governance.
Sentient Agent as a Judge: Evaluating Higher-Order Social Cognition in Large Language Models
Bang Zhang | Ruotian Ma | Qingxuan Jiang | Peisong Wang | Jiaqi Chen | Zheng Xie | Xingyu Chen | Yue Wang | Fanghua Ye | Jian Li | Yifan Yang | Zhaopeng Tu | Xiaolong Li
Findings of the Association for Computational Linguistics: ACL 2026
Bang Zhang | Ruotian Ma | Qingxuan Jiang | Peisong Wang | Jiaqi Chen | Zheng Xie | Xingyu Chen | Yue Wang | Fanghua Ye | Jian Li | Yifan Yang | Zhaopeng Tu | Xiaolong Li
Findings of the Association for Computational Linguistics: ACL 2026
Assessing how well a large language model (LLM) understands human, rather than merely text, remains an open challenge.To bridge the gap, we introduce Sentient Agent as a Judge(SAGE), an automated evaluation framework that measures an LLM’s higher-order social cognition.SAGE instantiates a “Sentient Agent” – an LLM-powered agent that simulates human-like emotional changes and inner thoughts to provide a more realistic evaluation of the tested model in multi-turn conversations.At every turn, the agent reasons about (i) how its emotion changes, (ii) how it feels, and (iii) how it should reply, yielding a numerical emotion trajectory and interpretable inner thoughts.Experiments on 100 supportive-dialogue scenarios show that the final Sentient emotion score correlates strongly with Barrett-Lennard Relationship Inventory (BLRI) ratings and utterance-level empathy metrics, validating psychological fidelity. Human evaluation further demonstrates 85.3% consistency between the agent’s emotional reasoning and human judgments. We also build a public Sentient Leaderboard covering 18 commercial and open-source models that uncovers substantial gaps (up to 4×) between frontier systems (GPT-4o-Latest, Gemini2.5-Pro) and earlier baselines, gaps not reflected in conventional leaderboards (e.g. Arena). SAGE thus provides a principled, scalable, and interpretable tool for tracking progress toward genuinely empathetic and socially adept language agents.
BatonVoice: An Operationalist Framework for Enhancing Controllable Speech Synthesis with Linguistic Intelligence from LLMs
Yue Wang | Ruotian Ma | Xingyu Chen | Zhengliang Shi | Morunliu Yang | Wanshun Chen | Huang Liu | Jiadi Yao | Xin He | Qu Yang | Qingxuan Jiang | Fanghua Ye | Juntao Li | Zhaopeng Tu | Xiaolong Li | Liefeng Bo | Min Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yue Wang | Ruotian Ma | Xingyu Chen | Zhengliang Shi | Morunliu Yang | Wanshun Chen | Huang Liu | Jiadi Yao | Xin He | Qu Yang | Qingxuan Jiang | Fanghua Ye | Juntao Li | Zhaopeng Tu | Xiaolong Li | Liefeng Bo | Min Zhang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The rise of Large Language Models (LLMs) is reshaping multimodel models, with speech synthesis being a prominent application. However, existing approaches often underutilize the linguistic intelligence of these models, typically failing to leverage their powerful instruction-following capabilities. This limitation hinders the model’s ability to follow text instructions for controllable Text-to-Speech (TTS). To address this, we propose a new paradigm inspired by operationalism that decouples instruction understanding from speech generation. We introduce BatonVoice, a framework where an LLM acts as a conductor, understanding user instructions and generating a textual plan – explicit vocal features (e.g., pitch, energy). A separate TTS model, the orchestra, then generates the speech from these features. To realize this component, we develop BatonTTS, a TTS model trained specifically for this task. Our experiments demonstrate that BatonVoice achieves strong performance in controllable and emotional speech synthesis, outperforming strong open- and closed-source baselines. Notably, our approach enables remarkable zero-shot cross-lingual generalization, accurately applying feature control abilities to languages unseen during post-training. This demonstrates that objectifying speech into textual vocal features can more effectively unlock the linguistic intelligence of LLMs.
2025
Salute the Classic: Revisiting Challenges of Machine Translation in the Age of Large Language Models
Jianhui Pang | Fanghua Ye | Derek Fai Wong | Dian Yu | Shuming Shi | Zhaopeng Tu | Longyue Wang
Transactions of the Association for Computational Linguistics, Volume 13
Jianhui Pang | Fanghua Ye | Derek Fai Wong | Dian Yu | Shuming Shi | Zhaopeng Tu | Longyue Wang
Transactions of the Association for Computational Linguistics, Volume 13
The evolution of Neural Machine Translation (NMT) has been significantly influenced by six core challenges (Koehn and Knowles, 2017) that have acted as benchmarks for progress in this field. This study revisits these challenges, offering insights into their ongoing relevance in the context of advanced Large Language Models (LLMs): domain mismatch, amount of parallel data, rare word prediction, translation of long sentences, attention model as word alignment, and sub-optimal beam search. Our empirical findings show that LLMs effectively reduce reliance on parallel data for major languages during pretraining and significantly improve translation of long sentences containing approximately 80 words, even translating documents up to 512 words. Despite these improvements, challenges in domain mismatch and rare word prediction persist. While NMT-specific challenges like word alignment and beam search may not apply to LLMs, we identify three new challenges in LLM-based translation: inference efficiency, translation of low-resource languages during pretraining, and human-aligned evaluation.
Conversational Education at Scale: A Multi-LLM Agent Workflow for Procedural Learning and Pedagogic Quality Assessment
Jiahuan Pei | Fanghua Ye | Xin Sun | Wentao Deng | Koen Hindriks | Junxiao Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Jiahuan Pei | Fanghua Ye | Xin Sun | Wentao Deng | Koen Hindriks | Junxiao Wang
Findings of the Association for Computational Linguistics: EMNLP 2025
Large language models (LLMs) have advanced virtual educators and learners, bridging NLP with AI4Education. Existing work often lacks scalability and fails to leverage diverse, large-scale course content, with limited frameworks for assessing pedagogic quality. To this end, we propose WikiHowAgent, a multi-agent workflow leveraging LLMs to simulate interactive teaching-learning conversations. It integrates teacher and learner agents, an interaction manager, and an evaluator to facilitate procedural learning and assess pedagogic quality. We introduce a dataset of 114,296 teacher-learner conversations grounded in 14,287 tutorials across 17 domains and 727 topics. Our evaluation protocol combines computational and rubric-based metrics with human judgment alignment. Results demonstrate the workflow’s effectiveness in diverse setups, offering insights into LLM capabilities across domains. Our datasets and implementations are fully open-sourced.
UNComp: Can Matrix Entropy Uncover Sparsity? — A Compressor Design from an Uncertainty-Aware Perspective
Jing Xiong | Jianghan Shen | Fanghua Ye | Chaofan Tao | Zhongwei Wan | Jianqiao Lu | Xun Wu | Chuanyang Zheng | Zhijiang Guo | Min Yang | Lingpeng Kong | Ngai Wong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Jing Xiong | Jianghan Shen | Fanghua Ye | Chaofan Tao | Zhongwei Wan | Jianqiao Lu | Xun Wu | Chuanyang Zheng | Zhijiang Guo | Min Yang | Lingpeng Kong | Ngai Wong
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Deploying large language models (LLMs) for long-context inference remains challenging due to their substantial memory and computational demands. While techniques such as Key-Value (KV) cache compression are designed to reduce memory usage, they often neglect the structured sparsity inherent in the relationship between hidden states and their corresponding KV cache. In this work, we explore the role of uncertainty as a potential indicator of sparsity within LLMs. We propose UNComp, an uncertainty-aware framework that leverages truncated matrix entropy to identify areas of low information content, thereby revealing sparsity patterns that can be used for adaptive compression. Unlike traditional methods that apply uniform compression, UNComp dynamically adjusts its approach to compression, guided by uncertainty measures that reflect the importance of various model components. Our analysis shows that sparsity patterns, when derived from uncertainty estimates, can be exploited to reveal special long-range dependencies, such as retrieval heads and retrieval layers. This perspective not only enhances our understanding of how compression can be optimized but also provides new insights into the inherent sparsity of LLMs during long-context inference. By focusing on uncertainty to analyze the sparsity pattern in detail, UNComp reduces the KV cache size to 4.74% of the original, achieves a 6% prefill speedup, and improves throughput by 6.4× — not only delivering strong lossless compression performance, but also validating the effectiveness of the underlying theoretical tool. Our codes are submitted with the paper.
CogDual: Enhancing Dual Cognition of LLMs via Reinforcement Learning with Implicit Rule-Based Rewards
Cheng Liu | Yifei Lu | Fanghua Ye | Jian Li | Xingyu Chen | Feiliang Ren | Zhaopeng Tu | Xiaolong Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Cheng Liu | Yifei Lu | Fanghua Ye | Jian Li | Xingyu Chen | Feiliang Ren | Zhaopeng Tu | Xiaolong Li
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Role-Playing Language Agents (RPLAs) have emerged as a significant application direction for Large Language Models (LLMs). Existing approaches typically rely on prompt engineering or supervised fine-tuning to enable models to imitate character behaviors in specific scenarios, but often neglect the underlying cognitive mechanisms driving these behaviors. Inspired by cognitive psychology, we introduce CogDual, a novel RPLA adopting a cognize-then-respond reasoning paradigm. By jointly modeling external situational awareness and internal self-awareness, CogDual generates responses with improved character consistency and contextual alignment. To further optimize the performance, we employ reinforcement learning with two general-purpose reward schemes designed for open-domain text generation. Extensive experiments on the CoSER benchmark, as well as Cross-MR and LifeChoice, demonstrate that CogDual consistently outperforms existing baselines and generalizes effectively across diverse role-playing tasks.
CodeTool: Enhancing Programmatic Tool Invocation of LLMs via Process Supervision
Yifei Lu | Fanghua Ye | Jian Li | Qiang Gao | Cheng Liu | Haibo Luo | Nan Du | Xiaolong Li | Feiliang Ren
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifei Lu | Fanghua Ye | Jian Li | Qiang Gao | Cheng Liu | Haibo Luo | Nan Du | Xiaolong Li | Feiliang Ren
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tool invocation significantly enhances the capabilities of Large Language Models (LLMs), yet challenges persist, particularly in complex task scenarios. Current methods, such as instruction-enhanced reasoning and supervised fine-tuning, often result in unnecessarily long reasoning paths and face difficulties in verifying the correctness of intermediate steps. In this paper, we propose CodeTool, a novel framework for stepwise code generation that improves LLM tool invocation by leveraging the concise and easily verifiable nature of code. CodeTool incorporates two distinct process rewards: the On-the-spot Reward, which provides immediate feedback on the accuracy of each tool invocation, and the Latent Reward, which assesses the contribution of each step toward overall task completion. By maximizing the cumulative reward of the On-the-spot and Latend Rewards at each step, LLMs are guided to follow efficient and accurate reasoning paths. Extensive experiments on StableToolBench and RestBench-TMDB demonstrate the superiority of CodeTool over existing approaches.
Understanding Large Language Model Vulnerabilities to Social Bias Attacks
Jiaxu Zhao | Meng Fang | Fanghua Ye | Ke Xu | Qin Zhang | Joey Tianyi Zhou | Mykola Pechenizkiy
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jiaxu Zhao | Meng Fang | Fanghua Ye | Ke Xu | Qin Zhang | Joey Tianyi Zhou | Mykola Pechenizkiy
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) have become foundational in human-computer interaction, demonstrating remarkable linguistic capabilities across various tasks. However, there is a growing concern about their potential to perpetuate social biases present in their training data. In this paper, we comprehensively investigate the vulnerabilities of contemporary LLMs to various social bias attacks, including prefix injection, refusal suppression, and learned attack prompts. We evaluate popular models such as LLaMA-2, GPT-3.5, and GPT-4 across gender, racial, and religious bias types. Our findings reveal that models are generally more susceptible to gender bias attacks compared to racial or religious biases. We also explore novel aspects such as cross-bias and multiple-bias attacks, finding varying degrees of transferability across bias types. Additionally, our results show that larger models and pretrained base models often exhibit higher susceptibility to bias attacks. These insights contribute to the development of more inclusive and ethically responsible LLMs, emphasizing the importance of understanding and mitigating potential bias vulnerabilities. We offer recommendations for model developers and users to enhance the robustness of LLMs against social bias attacks.
2024
Anchor-based Large Language Models
Jianhui Pang | Fanghua Ye | Derek Wong | Xin He | Wanshun Chen | Longyue Wang
Findings of the Association for Computational Linguistics: ACL 2024
Jianhui Pang | Fanghua Ye | Derek Wong | Xin He | Wanshun Chen | Longyue Wang
Findings of the Association for Computational Linguistics: ACL 2024
Large language models (LLMs) predominantly employ decoder-only transformer architectures, necessitating the retention of keys/values information for historical tokens to provide contextual information and avoid redundant computation. However, the substantial size and parameter volume of these LLMs require massive GPU memory. This memory demand increases with the length of the input text, leading to an urgent need for more efficient methods of information storage and processing. This study introduces Anchor-based LLMs (AnLLMs), which utilize an innovative anchor-based self-attention network (AnSAN) and also an anchor-based inference strategy. This approach enables LLMs to compress sequence information into an anchor token, reducing the keys/values cache and enhancing inference efficiency. Experiments on question-answering benchmarks reveal that AnLLMs maintain similar accuracy levels while achieving up to 99% keys/values cache reduction and up to 3.5 times faster inference. Despite a minor compromise in accuracy, the substantial enhancements of AnLLMs employing the AnSAN technique in resource utilization and computational efficiency underscore their potential for practical LLM applications.
Autonomous Workflow for Multimodal Fine-Grained Training Assistants Towards Mixed Reality
Jiahuan Pei | Irene Viola | Haochen Huang | Junxiao Wang | Moonisa Ahsan | Fanghua Ye | Jiang Yiming | Yao Sai | Di Wang | Zhumin Chen | Pengjie Ren | Pablo Cesar
Findings of the Association for Computational Linguistics: ACL 2024
Jiahuan Pei | Irene Viola | Haochen Huang | Junxiao Wang | Moonisa Ahsan | Fanghua Ye | Jiang Yiming | Yao Sai | Di Wang | Zhumin Chen | Pengjie Ren | Pablo Cesar
Findings of the Association for Computational Linguistics: ACL 2024
Autonomous artificial intelligence (AI) agents have emerged as promising protocols for automatically understanding the language-based environment, particularly with the exponential development of large language models (LLMs). However, a fine-grained, comprehensive understanding of multimodal environments remains under-explored. This work designs an autonomous workflow tailored for integrating AI agents seamlessly into extended reality (XR) applications for fine-grained training. We present a demonstration of a multimodal fine-grained training assistant for LEGO brick assembly in a pilot XR environment. Specifically, we design a cerebral language agent that integrates LLM with memory, planning, and interaction with XR tools and a vision-language agent, enabling agents to decide their actions based on past experiences. Furthermore, we introduce LEGO-MRTA, a multimodal fine-grained assembly dialogue dataset synthesized automatically in the workflow served by a commercial LLM. This dataset comprises multimodal instruction manuals, conversations, XR responses, and vision question answering. Last, we present several prevailing open-resource LLMs as benchmarks, assessing their performance with and without fine-tuning on the proposed dataset. We anticipate that the broader impact of this workflow will advance the development of smarter assistants for seamless user interaction in XR environments, fostering research in both AI and HCI communities.
Unveiling In-Context Learning: A Coordinate System to Understand Its Working Mechanism
Anhao Zhao | Fanghua Ye | Jinlan Fu | Xiaoyu Shen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Anhao Zhao | Fanghua Ye | Jinlan Fu | Xiaoyu Shen
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) exhibit remarkable in-context learning (ICL) capabilities. However, the underlying working mechanism of ICL remains poorly understood. Recent research presents two conflicting views on ICL: One emphasizes the impact of similar examples in the demonstrations, stressing the need for label correctness and more shots. The other attributes it to LLMs’ inherent ability of task recognition, deeming label correctness and shot numbers of demonstrations as not crucial. In this work, we provide a Two-Dimensional Coordinate System that unifies both views into a systematic framework. The framework explains the behavior of ICL through two orthogonal variables: whether similar examples are presented in the demonstrations (perception) and whether LLMs can recognize the task (cognition). We propose the peak inverse rank metric to detect the task recognition ability of LLMs and study LLMs’ reactions to different definitions of similarity. Based on these, we conduct extensive experiments to elucidate how ICL functions across each quadrant on multiple representative classification tasks. Finally, we extend our analyses to generation tasks, showing that our coordinate system can also be used to interpret ICL for generation tasks effectively.
2023
Enhancing Conversational Search: Large Language Model-Aided Informative Query Rewriting
Fanghua Ye | Meng Fang | Shenghui Li | Emine Yilmaz
Findings of the Association for Computational Linguistics: EMNLP 2023
Fanghua Ye | Meng Fang | Shenghui Li | Emine Yilmaz
Findings of the Association for Computational Linguistics: EMNLP 2023
Query rewriting plays a vital role in enhancing conversational search by transforming context-dependent user queries into standalone forms. Existing approaches primarily leverage human-rewritten queries as labels to train query rewriting models. However, human rewrites may lack sufficient information for optimal retrieval performance. To overcome this limitation, we propose utilizing large language models (LLMs) as query rewriters, enabling the generation of informative query rewrites through well-designed instructions. We define four essential properties for well-formed rewrites and incorporate all of them into the instruction. In addition, we introduce the role of rewrite editors for LLMs when initial query rewrites are available, forming a “rewrite-then-edit” process. Furthermore, we propose distilling the rewriting capabilities of LLMs into smaller models to reduce rewriting latency. Our experimental evaluation on the QReCC dataset demonstrates that informative query rewrites can yield substantially improved retrieval performance compared to human rewrites, especially with sparse retrievers.
Turn-Level Active Learning for Dialogue State Tracking
Zihan Zhang | Meng Fang | Fanghua Ye | Ling Chen | Mohammad-Reza Namazi-Rad
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Zihan Zhang | Meng Fang | Fanghua Ye | Ling Chen | Mohammad-Reza Namazi-Rad
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
Dialogue state tracking (DST) plays an important role in task-oriented dialogue systems. However, collecting a large amount of turn-by-turn annotated dialogue data is costly and inefficient. In this paper, we propose a novel turn-level active learning framework for DST to actively select turns in dialogues to annotate. Given the limited labelling budget, experimental results demonstrate the effectiveness of selective annotation of dialogue turns. Additionally, our approach can effectively achieve comparable DST performance to traditional training approaches with significantly less annotated data, which provides a more efficient way to annotate new dialogue data.
Modeling User Satisfaction Dynamics in Dialogue via Hawkes Process
Fanghua Ye | Zhiyuan Hu | Emine Yilmaz
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Fanghua Ye | Zhiyuan Hu | Emine Yilmaz
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dialogue systems have received increasing attention while automatically evaluating their performance remains challenging. User satisfaction estimation (USE) has been proposed as an alternative. It assumes that the performance of a dialogue system can be measured by user satisfaction and uses an estimator to simulate users. The effectiveness of USE depends heavily on the estimator. Existing estimators independently predict user satisfaction at each turn and ignore satisfaction dynamics across turns within a dialogue. In order to fully simulate users, it is crucial to take satisfaction dynamics into account. To fill this gap, we propose a new estimator ASAP (sAtisfaction eStimation via HAwkes Process) that treats user satisfaction across turns as an event sequence and employs a Hawkes process to effectively model the dynamics in this sequence. Experimental results on four benchmark dialogue datasets demonstrate that ASAP can substantially outperform state-of-the-art baseline estimators.
2022
MultiWOZ 2.4: A Multi-Domain Task-Oriented Dialogue Dataset with Essential Annotation Corrections to Improve State Tracking Evaluation
Fanghua Ye | Jarana Manotumruksa | Emine Yilmaz
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Fanghua Ye | Jarana Manotumruksa | Emine Yilmaz
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue
The MultiWOZ 2.0 dataset has greatly stimulated the research of task-oriented dialogue systems. However, its state annotations contain substantial noise, which hinders a proper evaluation of model performance. To address this issue, massive efforts were devoted to correcting the annotations. Three improved versions (i.e., MultiWOZ 2.1-2.3) have then been released. Nonetheless, there are still plenty of incorrect and inconsistent annotations. This work introduces MultiWOZ 2.4, which refines the annotations in the validation set and test set of MultiWOZ 2.1. The annotations in the training set remain unchanged (same as MultiWOZ 2.1) to elicit robust and noise-resilient model training. We benchmark eight state-of-the-art dialogue state tracking models on MultiWOZ 2.4. All of them demonstrate much higher performance than on MultiWOZ 2.1.
ASSIST: Towards Label Noise-Robust Dialogue State Tracking
Fanghua Ye | Yue Feng | Emine Yilmaz
Findings of the Association for Computational Linguistics: ACL 2022
Fanghua Ye | Yue Feng | Emine Yilmaz
Findings of the Association for Computational Linguistics: ACL 2022
The MultiWOZ 2.0 dataset has greatly boosted the research on dialogue state tracking (DST). However, substantial noise has been discovered in its state annotations. Such noise brings about huge challenges for training DST models robustly. Although several refined versions, including MultiWOZ 2.1-2.4, have been published recently, there are still lots of noisy labels, especially in the training set. Besides, it is costly to rectify all the problematic annotations. In this paper, instead of improving the annotation quality further, we propose a general framework, named ASSIST (lAbel noiSe-robuSt dIalogue State Tracking), to train DST models robustly from noisy labels. ASSIST first generates pseudo labels for each sample in the training set by using an auxiliary model trained on a small clean dataset, then puts the generated pseudo labels and vanilla noisy labels together to train the primary model. We show the validity of ASSIST theoretically. Experimental results also demonstrate that ASSIST improves the joint goal accuracy of DST by up to 28.16% on MultiWOZ 2.0 and 8.41% on MultiWOZ 2.4, compared to using only the vanilla noisy labels.
MetaASSIST: Robust Dialogue State Tracking with Meta Learning
Fanghua Ye | Xi Wang | Jie Huang | Shenghui Li | Samuel Stern | Emine Yilmaz
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Fanghua Ye | Xi Wang | Jie Huang | Shenghui Li | Samuel Stern | Emine Yilmaz
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Existing dialogue datasets contain lots of noise in their state annotations. Such noise can hurt model training and ultimately lead to poor generalization performance. A general framework named ASSIST has recently been proposed to train robust dialogue state tracking (DST) models. It introduces an auxiliary model to generate pseudo labels for the noisy training set. These pseudo labels are combined with vanilla labels by a common fixed weighting parameter to train the primary DST model. Notwithstanding the improvements of ASSIST on DST, tuning the weighting parameter is challenging. Moreover, a single parameter shared by all slots and all instances may be suboptimal. To overcome these limitations, we propose a meta learning-based framework MetaASSIST to adaptively learn the weighting parameter. Specifically, we propose three schemes with varying degrees of flexibility, ranging from slot-wise to both slot-wise and instance-wise, to convert the weighting parameter into learnable functions. These functions are trained in a meta-learning manner by taking the validation set as meta data. Experimental results demonstrate that all three schemes can achieve competitive performance. Most impressively, we achieve a state-of-the-art joint goal accuracy of 80.10% on MultiWOZ 2.4.
Dynamic Schema Graph Fusion Network for Multi-Domain Dialogue State Tracking
Yue Feng | Aldo Lipani | Fanghua Ye | Qiang Zhang | Emine Yilmaz
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yue Feng | Aldo Lipani | Fanghua Ye | Qiang Zhang | Emine Yilmaz
Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Dialogue State Tracking (DST) aims to keep track of users’ intentions during the course of a conversation. In DST, modelling the relations among domains and slots is still an under-studied problem. Existing approaches that have considered such relations generally fall short in: (1) fusing prior slot-domain membership relations and dialogue-aware dynamic slot relations explicitly, and (2) generalizing to unseen domains. To address these issues, we propose a novel Dynamic Schema Graph Fusion Network (DSGFNet), which generates a dynamic schema graph to explicitly fuse the prior slot-domain membership relations and dialogue-aware dynamic slot relations. It also uses the schemata to facilitate knowledge transfer to new domains. DSGFNet consists of a dialogue utterance encoder, a schema graph encoder, a dialogue-aware schema graph evolving network, and a schema graph enhanced dialogue state decoder. Empirical results on benchmark datasets (i.e., SGD, MultiWOZ2.1, and MultiWOZ2.2), show that DSGFNet outperforms existing methods.
2020
Unsupervised Few-Bits Semantic Hashing with Implicit Topics Modeling
Fanghua Ye | Jarana Manotumruksa | Emine Yilmaz
Findings of the Association for Computational Linguistics: EMNLP 2020
Fanghua Ye | Jarana Manotumruksa | Emine Yilmaz
Findings of the Association for Computational Linguistics: EMNLP 2020
Semantic hashing is a powerful paradigm for representing texts as compact binary hash codes. The explosion of short text data has spurred the demand of few-bits hashing. However, the performance of existing semantic hashing methods cannot be guaranteed when applied to few-bits hashing because of severe information loss. In this paper, we present a simple but effective unsupervised neural generative semantic hashing method with a focus on few-bits hashing. Our model is built upon variational autoencoder and represents each hash bit as a Bernoulli variable, which allows the model to be end-to-end trainable. To address the issue of information loss, we introduce a set of auxiliary implicit topic vectors. With the aid of these topic vectors, the generated hash codes are not only low-dimensional representations of the original texts but also capture their implicit topics. We conduct comprehensive experiments on four datasets. The results demonstrate that our approach achieves significant improvements over state-of-the-art semantic hashing methods in few-bits hashing.
Search
Fix author
Co-authors
- Xiaolong Li 7
- Emine Yilmaz 7
- Zhaopeng Tu 6
- Xingyu Chen 5
- Jian Li 4
- Ruotian Ma 4
- Liefeng Bo 3
- Meng Fang 3
- Qingxuan Jiang 3
- Yue Wang 3
- Qu Yang 3
- Wanshun Chen 2
- Nan Du 2
- Yue Feng 2
- Qiang Gao 2
- Xin He 2
- Shenghui Li 2
- Cheng Liu 2
- Yifei Lu 2
- Jarana Manotumruksa 2
- Jianhui Pang 2
- Jiahuan Pei 2
- Feiliang Ren 2
- Zhengliang Shi 2
- Junxiao Wang 2
- Longyue Wang 2
- Mengru Wang 2
- Moonisa Ahsan 1
- Pablo Cesar 1
- Jiaqi Chen 1
- Ling Chen 1
- Zhumin Chen 1
- Ziyang Chen 1
- Wentao Deng 1
- Sijing Duan 1
- Jinlan Fu 1
- Zhijiang Guo 1
- Koen Hindriks 1
- Zhiyuan Hu 1
- Haochen Huang 1
- Jen-tse Huang 1
- Jie Huang 1
- Lingpeng Kong 1
- Cixing LI 1
- Juntao Li 1
- Aldo Lipani 1
- Huang Liu 1
- Jianqiao Lu 1
- Haibo Luo 1
- Xinbei Ma 1
- Mohammad-Reza Namazi-Rad 1
- Mykola Pechenizkiy 1
- Pengjie Ren 1
- Zhaochun Ren 1
- Yao Sai 1
- Jianghan Shen 1
- Xiaoyu Shen 1
- Ying Shen 1
- Shuming Shi 1
- Samuel Stern 1
- Xin Sun 1
- Chaofan Tao 1
- Yuxing Tian 1
- Irene Viola 1
- Zhongwei Wan 1
- Di Wang 1
- Peisong Wang 1
- Shanyi Wang 1
- Wenxuan Wang 1
- Xi Wang 1
- Derek F. Wong (黄辉) 1
- Derek Fai Wong 1
- Ngai Wong 1
- Haolun Wu 1
- Xun Wu 1
- Zheng Xie 1
- Jing Xiong 1
- Ke Xu 1
- Min Yang 1
- Morunliu Yang 1
- Yifan Yang 1
- Jiadi Yao 1
- Zihao Yi 1
- Jiang Yiming 1
- Dian Yu 1
- Bang Zhang 1
- Min Zhang 1
- Qiang Zhang 1
- Qin Zhang 1
- Weixu Zhang 1
- Zihan Zhang 1
- Anhao Zhao 1
- Jiaxu Zhao 1
- Chuanyang Zheng 1
- Joey Tianyi Zhou 1