Xue Liu
2026
RubricBench: Aligning Model-Generated Rubrics with Human Standards
Junyi Zhou | Qiyuan Zhang | Yufei Wang | Fuyuan Lyu | Yidong Ming | Can Xu | Qingfeng Sun | Kai Zheng | Peng Kang | Xue Liu | Chen Ma
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Junyi Zhou | Qiyuan Zhang | Yufei Wang | Fuyuan Lyu | Yidong Ming | Can Xu | Qingfeng Sun | Kai Zheng | Peng Kang | Xue Liu | Chen Ma
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
As Large Language Model (LLM) alignment evolves from simple completions to complex, highly sophisticated generation, Reward Models are increasingly shifting toward rubric-guided evaluation to mitigate surface-level biases. However, the community lacks a unified benchmark to assess this evaluation paradigm, as existing benchmarks lack both the discriminative complexity and the ground-truth rubric annotations required for rigorous analysis. To bridge this gap, we introduce RubricBench, a curated benchmark with 1,147 pairwise comparisons specifically designed to assess the reliability of rubric-based evaluation. Our construction employs a multi-dimensional filtration pipeline to target hard samples featuring nuanced input complexity and misleading surface bias, augmenting each with expert-annotated, atomic rubrics derived strictly from instructions. Comprehensive experiments reveal a substantial capability gap between human-annotated and model-generated rubrics, indicating that even state-of-the-art models struggle to autonomously specify valid evaluation criteria, lagging considerably behind human-guided performance.
GAM: Hierarchical Graph-based Agentic Memory for LLM Agents
Zhaofen Wu | Hanrong Zhang | Fulin Lin | Wujiang Xu | Xinran Xu | Yankai Chen | Henry Peng Zou | Shaowen Chen | Weizhi Zhang | Xue Liu | Philip S. Yu | Hongwei Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhaofen Wu | Hanrong Zhang | Fulin Lin | Wujiang Xu | Xinran Xu | Yankai Chen | Henry Peng Zou | Shaowen Chen | Weizhi Zhang | Xue Liu | Philip S. Yu | Hongwei Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
To sustain coherent long-term interactions, Large Language Model (LLM) agents must navigate the tension between acquiring new information and retaining prior knowledge. Current unified stream-based memory systems facilitate context updates but remain vulnerable to interference from transient noise. Conversely, discrete structured memory architectures provide robust knowledge retention but often struggle to adapt to fluid narrative evolution. To address this, we propose GAM, a hierarchical Graph-based Agentic Memory framework that explicitly decouples memory encoding from consolidation to effectively resolve the conflict between rapid context perception and stable knowledge retention. By isolating ongoing dialogue in a event progression graph and integrating it into a topic associative network only upon semantic shifts, our approach minimizes interference while preserving long-term consistency. Additionally, we introduce a Graph-guided, Multi-factor Retrieval strategy to enhance context precision. Experiments on LoCoMo and LongDialQA benchmarks indicate that our method consistently outperforms state-of-the-art baselines in both reasoning accuracy and computational efficiency.
Optimizing User Profiles via Contextual Bandits for Retrieval-Augmented LLM Personalization
Linfeng Du | Ye Yuan | Zichen Zhao | Fuyuan Lyu | Emiliano Penaloza | Xiuying Chen | Zipeng Sun | Jikun Kang | Laurent Charlin | Xue Liu | Haolun Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Linfeng Du | Ye Yuan | Zichen Zhao | Fuyuan Lyu | Emiliano Penaloza | Xiuying Chen | Zipeng Sun | Jikun Kang | Laurent Charlin | Xue Liu | Haolun Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large language models (LLMs) excel at general-purpose tasks, yet adapting their responses to individual users remains challenging. Retrieval augmentation provides a lightweight alternative to fine-tuning by conditioning LLMs on user history records, and existing approaches typically select these records based on semantic relevance. We argue that relevance serves as an unreliable proxy for utility: a record may be semantically similar to a query yet fail to improve generation quality or even degrade it due to redundancy or conflicting information. To bridge this gap, we propose PURPLE, a contextual bandit framework that oPtimizes UseR Profiles for LLM pErsonalization. In contrast to a greedy selection of the most relevant records, PURPLE treats profile construction as an order-sensitive generation process and utilizes a Plackett-Luce ranking model to capture complex inter-record dependencies. By training with semantically rich feedback provided by the likelihood of the reference response, our method aligns retrieval directly with generation quality. Extensive experiments on nine personalization tasks demonstrate that PURPLE consistently outperforms strong heuristic and retrieval-augmented baselines in both effectiveness and efficiency, establishing a principled and scalable solution for optimizing user profiles.
Awakening Dormant Experts:Counterfactual Routing to Mitigate MoE Hallucinations
Wentao Hu | Yanbo Zhai | Xiaohui Hu | Mingkuan Zhao | Shanhong yu | Xue Liu | Kaidong Yu | Shuangyong Song | Xuelong Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Wentao Hu | Yanbo Zhai | Xiaohui Hu | Mingkuan Zhao | Shanhong yu | Xue Liu | Kaidong Yu | Shuangyong Song | Xuelong Li
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Sparse Mixture-of-Experts (MoE) models have achieved remarkable scalability, yet they remain vulnerable to hallucinations, particularly when processing long-tail knowledge. We identify that this fragility stems from static Top-k routing: routers tend to favor high-frequency patterns over rare factual associations. Consequently, "specialist experts" possessing critical long-tail knowledge are often assigned low gating scores and remain "dormant"—under-prioritized for specific tokens despite their proven causal importance on other inputs. To address this, we propose Counterfactual Routing (CoR), a training-free inference framework designed to awaken these dormant experts. CoR integrates layer-wise perturbation analysis with the Counterfactual Expert Impact (CEI) metric to dynamically shift computational resources from syntax-dominant to knowledge-intensive layers while maintaining a constant total activation count, effectively retrieving causally decisive experts via virtual ablation. Extensive experiments on TruthfulQA, FACTOR, and TriviaQA demonstrate that CoR improves factual accuracy by 3.1% on average without increasing the inference budget, establishing a superior Pareto frontier compared to static scaling strategies.
Preference Heads in Large Language Models: A Mechanistic Framework for Interpretable Personalization
Weixu Zhang | Ye Yuan | Changjiang Han | Yuxing Tian | Zipeng Sun | Linfeng Du | Jikun Kang | Hong Kang | Xue Liu | Haolun Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Weixu Zhang | Ye Yuan | Changjiang Han | Yuxing Tian | Zipeng Sun | Linfeng Du | Jikun Kang | Hong Kang | Xue Liu | Haolun Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Large Language Models (LLMs) exhibit strong implicit personalization ability, yet most existing approaches treat this behavior as a black box, relying on prompt engineering or fine tuning on user data. In this work, we adopt a mechanistic interpretability perspective and hypothesize the existence of a sparse set of Preference Heads, attention heads that encode user specific stylistic and topical preferences and exert a causal influence on generation. We introduce Differential Preference Steering (DPS), a training free framework that (1) identifies Preference Heads through causal masking analysis and (2) leverages them for controllable and interpretable personalization at inference time. DPS computes a Preference Contribution Score (PCS) for each attention head, directly measuring its causal impact on user aligned outputs. During decoding, we contrast model predictions with and without Preference Heads, amplifying the difference between personalized and generic logits to selectively strengthen preference aligned continuations. Experiments on widely used personalization benchmarks across multiple LLMs demonstrate consistent gains in personalization fidelity while preserving content coherence and low computational overhead. Beyond empirical improvements, DPS provides a mechanistic explanation of where and how personalization emerges within transformer architectures.
Same Claim, Different Judgment: Benchmarking Scenario-Induced Bias in Multilingual Financial Misinformation Detection
Zhiwei Liu | Yupeng Cao | Yuechen Jiang | Mohsinul Kabir | Polydoros Giannouris | Chen Xu | Ziyang Xu | Tianlei Zhu | Md. Tariquzzaman | Triantafillos Papadopoulos | Yan Wang | Lingfei Qian | Xueqing Peng | Zhuohan Xie | Ye Yuan | Saeed Almheiri | Abdulrazzaq Alnajjar | Ming-Bin Chen | Harry Stuart | Paul Thompson | Prayag Tiwari | Alejandro Lopez-Lira | Xue Liu | Jimin Huang | Sophia Ananiadou
Findings of the Association for Computational Linguistics: ACL 2026
Zhiwei Liu | Yupeng Cao | Yuechen Jiang | Mohsinul Kabir | Polydoros Giannouris | Chen Xu | Ziyang Xu | Tianlei Zhu | Md. Tariquzzaman | Triantafillos Papadopoulos | Yan Wang | Lingfei Qian | Xueqing Peng | Zhuohan Xie | Ye Yuan | Saeed Almheiri | Abdulrazzaq Alnajjar | Ming-Bin Chen | Harry Stuart | Paul Thompson | Prayag Tiwari | Alejandro Lopez-Lira | Xue Liu | Jimin Huang | Sophia Ananiadou
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have been widely applied across various domains of finance. Since their training data are largely derived from human-authored corpora, LLMs may inherit a range of human biases. Behavioral biases can lead to instability and uncertainty in decision-making, particularly when processing financial information. However, existing research on LLM bias has mainly focused on direct questioning or simplified, general-purpose settings, with limited consideration of the complex real-world financial environments and high-risk, context-sensitive, multilingual financial misinformation detection tasks (MFMD). In this work, we propose MFMDScen, a comprehensive benchmark for evaluating behavioral biases of LLMs in MFMD across diverse economic scenarios. In collaboration with financial experts, we construct three types of complex financial scenarios: (i) role- and personality-based, (ii) role- and region-based, and (iii) role-based scenarios incorporating ethnicity and religious beliefs. We further develop a multilingual financial misinformation dataset covering English, Chinese, Greek, and Bengali. By integrating these scenarios with misinformation claims, MFMDScen enables a systematic evaluation of 22 mainstream LLMs. Our findings reveal that pronounced behavioral biases persist across both commercial and open-source models. This project is available at https://github.com/lzw108/FMD.
Sherry: Hardware-Efficient 1.25-Bit Ternary Quantization via Fine-grained Sparsification
Hong Huang | Decheng Wu | Qiangqiang Hu | Guanghua Yu | Jinhai Yang | Jianchen Zhu | Xue Liu | Dapeng Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hong Huang | Decheng Wu | Qiangqiang Hu | Guanghua Yu | Jinhai Yang | Jianchen Zhu | Xue Liu | Dapeng Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The deployment of Large Language Models (LLMs) on resource-constrained edge devices is increasingly hindered by prohibitive memory and computational requirements. While ternary quantization offers a compelling solution by reducing weights to -1, 0, +1, current implementations suffer from a fundamental misalignment with commodity hardware. Most existing methods must choose between 2-bit aligned packing, which incurs significant bit wastage, or 1.67-bit irregular packing, which degrades inference speed. To resolve this tension, we propose Sherry, a hardware-efficient ternary quantization framework. Sherry introduces a 3:4 fine-grained sparsity that achieves a regularized 1.25-bit width by packing blocks of four weights into five bits, restoring power-of-two alignment. Furthermore, we identify weight trapping issue in sparse ternary training, which leads to representational collapse. To address this, Sherry introduces Arenas, an annealing residual synapse mechanism that maintains representational diversity during training. Empirical evaluations on LLaMA-3.2 across five benchmarks demonstrate that Sherry matches state-of-the-art ternary performance while significantly reducing model size. Notably, on an Intel i7-14700HX CPU, our 1B model achieves zero accuracy loss compared to SOTA baselines while providing 25% bit savings and 10% speed up. The code is available at https://github.com/Tencent/AngelSlim.
Beyond Length Scaling: Synergizing Breadth and Depth for Generative Reward Models
Qiyuan Zhang | Yufei Wang | Tianhe Wu | Can Xu | Qingfeng Sun | Kai Zheng | Xue Liu | Chen Ma
Findings of the Association for Computational Linguistics: ACL 2026
Qiyuan Zhang | Yufei Wang | Tianhe Wu | Can Xu | Qingfeng Sun | Kai Zheng | Xue Liu | Chen Ma
Findings of the Association for Computational Linguistics: ACL 2026
Recent advancements in Generative Reward Models (GRMs) have demonstrated that scaling the length of Chain-of-Thought (CoT) reasoning considerably enhances the reliability of evaluation. However, current works predominantly rely on unstructured length scaling, ignoring the divergent efficacy of different reasoning mechanisms: Breadth-CoT (multi-dimensional principle coverage) and Depth-CoT (substantive judgment soundness). To address this, we introduce Mix-GRM, a framework that reconfigures raw rationales into structured Breadth-CoT and Depth-CoT through a modular synthesis pipeline, subsequently employing Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR) to internalize and optimize these mechanisms. Comprehensive experiments demonstrate that Mix-GRM establishes a new state-of-the-art across five benchmarks, surpassing leading open-source RMs by an average of 8.2%. Our results reveal a clear divergence in reasoning: Breadth-CoT benefits subjective preference tasks, whereas Depth-CoT excels in objective correctness tasks. Consequently, misaligning the reasoning mechanism with the task directly degrades performance. Furthermore, we demonstrate that RLVR acts as a switching amplifier, inducing an emergent polarization where the model spontaneously allocates its reasoning style to match task demands.
2025
Warmup Generations: A Task-Agnostic Approach for Guiding Sequence-to-Sequence Learning with Unsupervised Initial State Generation
Senyu Li | Zipeng Sun | Jiayi Wang | Xue Liu | Pontus Stenetorp | Siva Reddy | David Ifeoluwa Adelani
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Senyu Li | Zipeng Sun | Jiayi Wang | Xue Liu | Pontus Stenetorp | Siva Reddy | David Ifeoluwa Adelani
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Traditional supervised fine-tuning (SFT) strategies for sequence-to-sequence tasks often train models to directly generate the target output. Recent work has shown that guiding models with intermediate steps—such as keywords, outlines, or reasoning chains—can significantly improve performance, coherence, and interpretability. However, these methods often depend on predefined intermediate formats and annotated data, limiting their scalability and generalizability. In this work, we introduce a task-agnostic framework that enables models to generate intermediate “warmup” sequences. These warmup sequences, serving as an initial state for subsequent generation, are optimized to enhance the probability of generating the target sequence without relying on external supervision or human-designed structures. Drawing inspiration from reinforcement learning principles, our method iteratively refines these intermediate steps to maximize their contribution to the final output, similar to reward-driven optimization in reinforcement learning with human feedback. Experimental results across tasks such as translation, summarization, and multi-choice question answering for logical reasoning show that our approach outperforms traditional SFT methods, and offers a scalable and flexible solution for sequence-to-sequence tasks.
2024
Learning to Extract Structured Entities Using Language Models
Haolun Wu | Ye Yuan | Liana Mikaelyan | Alexander Meulemans | Xue Liu | James Hensman | Bhaskar Mitra
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Haolun Wu | Ye Yuan | Liana Mikaelyan | Alexander Meulemans | Xue Liu | James Hensman | Bhaskar Mitra
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Recent advances in machine learning have significantly impacted the field of information extraction, with Language Models (LMs) playing a pivotal role in extracting structured information from unstructured text. Prior works typically represent information extraction as triplet-centric and use classical metrics such as precision and recall for evaluation. We reformulate the task to be entity-centric, enabling the use of diverse metrics that can provide more insights from various perspectives. We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP (AESOP) metric, designed to appropriately assess model performance. Later, we introduce a new Multistage Structured Entity Extraction (MuSEE) model that harnesses the power of LMs for enhanced effectiveness and efficiency by decomposing the extraction task into multiple stages. Quantitative and human side-by-side evaluations confirm that our model outperforms baselines, offering promising directions for future advancements in structured entity extraction. Our source code is available at https://github.com/microsoft/Structured-Entity-Extraction.
Collaborative Performance Prediction for Large Language Models
Qiyuan Zhang | Fuyuan Lyu | Xue Liu | Chen Ma
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Qiyuan Zhang | Fuyuan Lyu | Xue Liu | Chen Ma
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Comprehensively understanding and accurately predicting the performance of large language models across diverse downstream tasks has emerged as a pivotal challenge in NLP research. The pioneering scaling law on downstream works demonstrated intrinsic similarities within model families and utilized such similarities for performance prediction. However, they tend to overlook the similarities between model families and only consider design factors listed in the original scaling law. To overcome these limitations, we introduce a novel framework, Collaborative Performance Prediction (CPP), which significantly enhances prediction accuracy by leveraging the historical performance of various models on downstream tasks and other design factors for both model and task. We also collect a collaborative data sourced from online platforms containing both historical performance and additional design factors. With the support of the collaborative data, CPP not only surpasses traditional scaling laws in predicting the performance of scaled LLMs but also facilitates a detailed analysis of factor importance, an area previously overlooked.
2007
Search
Fix author
Co-authors
- Ye Yuan 4
- Fuyuan Lyu 3
- Chen Ma 3
- Zipeng Sun 3
- Haolun Wu 3
- Qiyuan Zhang 3
- Linfeng Du 2
- Jikun Kang 2
- Qingfeng Sun 2
- Yufei Wang 2
- Can Xu 2
- Kai Zheng 2
- David Ifeoluwa Adelani 1
- Saeed Almheiri 1
- Abdulrazzaq Alnajjar 1
- Sophia Ananiadou 1
- Yupeng Cao 1
- Laurent Charlin 1
- Yankai Chen 1
- Shaowen Chen 1
- Xiuying Chen 1
- Ming-Bin Chen 1
- Polydoros Giannouris 1
- Changjiang Han 1
- James Hensman 1
- Wentao Hu 1
- Xiaohui Hu 1
- Junling Hu 1
- Qiangqiang Hu 1
- Jimin Huang 1
- Hong Huang 1
- Yuechen Jiang 1
- Mohsinul Kabir 1
- Peng Kang 1
- Hong Kang 1
- Senyu Li 1
- Xuelong Li 1
- Fulin Lin 1
- Zhiwei Liu 1
- Alejandro Lopez-Lira 1
- Alexander Meulemans 1
- Liana Mikaelyan 1
- Yidong Ming 1
- Bhaskar Mitra 1
- Fabrizio Morbini 1
- Triantafillos Papadopoulos 1
- Emiliano Penaloza 1
- Xueqing Peng 1
- Lingfei Qian 1
- Siva Reddy 1
- Shuangyong Song (宋双永) 1
- Pontus Stenetorp 1
- Harry Stuart 1
- Md. Tariquzzaman 1
- Paul Thompson 1
- Yuxing Tian 1
- Prayag Tiwari 1
- Hongwei Wang 1
- Jiayi Wang 1
- Yan Wang 1
- Fuliang Weng 1
- Zhaofen Wu 1
- Decheng Wu 1
- Dapeng Wu 1
- Tianhe Wu 1
- Zhuohan Xie 1
- Wujiang Xu 1
- Xinran Xu 1
- Chen Xu 1
- Ziyang Xu 1
- Jinhai Yang 1
- Philip S. Yu 1
- Kaidong Yu 1
- Guanghua Yu 1
- Yanbo Zhai 1
- Hanrong Zhang 1
- Weizhi Zhang 1
- Weixu Zhang 1
- Zichen Zhao 1
- Mingkuan Zhao 1
- Junyi Zhou 1
- Tianlei Zhu 1
- Jianchen Zhu 1
- Henry Peng Zou 1
- Shanhong yu 1