Hao Liao
2026
PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models
Wenlong Shi | Jianxun Lian | Mingqi Wu | Haiming Qin | Mingyang Zhou | Xing Xie | Naipeng Chao | Hao Liao
Findings of the Association for Computational Linguistics: ACL 2026
Wenlong Shi | Jianxun Lian | Mingqi Wu | Haiming Qin | Mingyang Zhou | Xing Xie | Naipeng Chao | Hao Liao
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) increasingly serve as interactive social agents, yet their ability to maintain coherent and authentic persona-level role-playing remains limited, particularly in realistic social scenarios. Existing research predominantly focuses on character-level settings and relies on static evaluation formats, failing to capture the complexity of everyday social interactions. In this work, we present PersonaArena, a dynamic simulation framework for evaluating and improving persona-level role-playing in LLMs. PersonaArena leverages a large, filtered corpus of user-generated social content to construct a nuanced persona bank, and elicits multi-turn, context-rich interactions within simulated social environments. Our framework features a multi-agent debating judge for holistic and unbiased assessment. Through extensive experiments, we demonstrate that PersonaArena enables rigorous evaluation and enhancement of LLMs’ role-playing capabilities, advancing the development of more authentic and socially adept AI agents. Our codes and long appendix are available at https://anonymous.4open.science/r/PersonaArena-B323/.
Knowing-but-Doing: Diagnosing and Defending Role-Play-Driven LLMs Jailbreaks via Moral Disengagement
Haiming Qin | Jianxun Lian | Qimin Zhong | Mingyang Zhou | Hao Liao | Naipeng Chao
Findings of the Association for Computational Linguistics: ACL 2026
Haiming Qin | Jianxun Lian | Qimin Zhong | Mingyang Zhou | Hao Liao | Naipeng Chao
Findings of the Association for Computational Linguistics: ACL 2026
Large Language Models (LLMs) are increasingly deployed in role-play scenarios, but their safety implications remain under-characterized. We present an explanatory framework grounded in Bandura’s Moral Disengagement theory and introduce a diagnostic benchmark (MD-Trace) for role-play jailbreaks. In our experiments, role-play improves safety behavior for benign personas while increasing unsafe compliance for malicious ones. We observe a Knowing-but-Doing failure in which models recognize safety risks in their thinking traces yet proceed to comply with harmful requests. Mechanism analysis suggests that Moral Justification is dominant, with Disregard of Consequences appearing as a secondary pattern. We compare multiple attack and defense methods and find that the diagnosis aligns with observed failure modes. Finally, we propose MD-Shield, an introspection-based defense that reduces attack success while maintaining Role Fidelity. The source code is publicly available at https://github.com/lavapapa/MoralJustify/.
Eliminating Out-of-Domain Recommendations in LLM-based Recommender Systems: A Unified View
Hao Liao | Jiwei Zhang | Jianxun Lian | Wensheng Lu | Mingqi Wu | Shuowangg | Yong Zhang | Yitian Huang | Mingyang Zhou | Rui Mao
Findings of the Association for Computational Linguistics: ACL 2026
Hao Liao | Jiwei Zhang | Jianxun Lian | Wensheng Lu | Mingqi Wu | Shuowangg | Yong Zhang | Yitian Huang | Mingyang Zhou | Rui Mao
Findings of the Association for Computational Linguistics: ACL 2026
Recommender systems based on Large Language Models (LLMs) are often plagued by hallucinations of out-of-domain (OOD) items. To address this, we propose RecLM, a unified framework that bridges the gap between retrieval and generation by instantiating three grounding paradigms under a single architecture: embedding-based retrieval, constrained generation over rewritten item titles, and discrete item-tokenizer generation. Using the same backbone LLM and prompts, we systematically compare these three views on public benchmarks. RecLM strictly eradicates OOD recommendations (OOD@10 = 0) across all variants, and the constrained generation variants RecLM-cgen and RecLM-token achieve overall state-of-the-art accuracy compared to both strong ID-based and LLM-based baselines. Our unified view provides a systematic basis for comparing three distinct paradigms to reduce item hallucinations, offering a practical framework to facilitate the application of LLMs to recommendation tasks. Source code is at https://github.com/microsoft/RecAI.
Let Retrievers Think Before Action: Thought-Augmented Embedding for Dense Retrieval
Ruiran Yan | Wen Xiong | Ze Liu | Chaozhuo Li | Hao Liao | Defu Lian | Zheng Liu
Findings of the Association for Computational Linguistics: ACL 2026
Ruiran Yan | Wen Xiong | Ze Liu | Chaozhuo Li | Hao Liao | Defu Lian | Zheng Liu
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) have demonstrated that explicitly performing step-by-step thinking before producing final outputs can substantially improve performance on complex tasks, as exemplified by recent reasoning-oriented models such as OpenAI O1 and DeepSeek R1. Inspired by these advancements, we propose the O1 Embedder, a novel approach aiming to endow retrieval models with similar capabilities to address challenges like multi-task retrieval, zero-shot retrieval, and tasks requiring intensive reasoning of complex relationships. The O1 Embedder generates preliminary thoughts for input queries before document retrieval. To realize this objective, we address two fundamental challenges in integrating thinking mechanisms into dense retrieval. First, retrieval tasks lack explicit supervision for intermediate thinking processes, making it difficult to define thoughts that are truly useful for retrieval. We address this challenge with a data synthesis framework following an “Exploration-Refinement” process, ensuring alignment with retrieval utility. Second, effectively integrating thought generation with representation learning requires a unified modeling framework that can jointly support generation and embedding within a single model. O1 Embedder addresses this challenge by jointly optimizing thought generation and dense retrieval in an end-to-end manner, enhancing retrieval accuracy while reducing complexity through a single deployable model. Extensive evaluations across diverse datasets demonstrate significant performance improvements, highlighting the effectiveness and generalization capability of O1 Embedder.
Toward Consistent World Models with Multi-Token Prediction and Latent Semantic Enhancement
Qimin Zhong | Hao Liao | Haiming Qin | Mingyang Zhou | Rui Mao | Wei Chen | Naipeng Chao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Qimin Zhong | Hao Liao | Haiming Qin | Mingyang Zhou | Rui Mao | Wei Chen | Naipeng Chao
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Whether Large Language Models (LLMs) develop coherent internal world models remains a core debate. While conventional Next-Token Prediction (NTP) focuses on one-step-ahead supervision, Multi-Token Prediction (MTP) has shown promise in learning more structured representations. In this work, we provide a theoretical perspective analyzing the gradient inductive bias of MTP, supported by empirical evidence, showing that MTP promotes the convergence toward internal belief states by inducing representational contractivity via gradient coupling. However, we reveal that standard MTP often suffers from structural hallucinations, where discrete token supervision encourages illegal shortcuts in latent space that violate environmental constraints. To address this, we propose a novel method **Latent Semantic Enhancement MTP (LSE-MTP)**, which anchors predictions to ground-truth hidden state trajectories. Experiments on synthetic graphs and real-world Manhattan Taxi Ride show that LSE-MTP effectively bridges the gap between discrete tokens and continuous state representations, enhancing representation alignment, reducing structural hallucinations, and improving robustness to perturbations.
2025
Hierarchical Reward Modeling for Fault Localization in Large Code Repositories
Jiwei Zhang | Jianxun Lian | Haiming Qin | Mingyang Zhou | KeZhong Lu | Rui Mao | Hao Liao
Findings of the Association for Computational Linguistics: EMNLP 2025
Jiwei Zhang | Jianxun Lian | Haiming Qin | Mingyang Zhou | KeZhong Lu | Rui Mao | Hao Liao
Findings of the Association for Computational Linguistics: EMNLP 2025
Large Language Models (LLMs) exhibit significant potential in complex software engineering tasks, however, their fault localization capabilities within repository are constrained by inherent limitations in max context length. Although Test-Time Scaling (TTS) can generate multiple candidate solutions, traditional selection strategies often fail to identify the optimal one. To solve this problem, we introduces Hierarchical Localization Reward Model (HiLoRM), which specifically designed to evaluate and select the most accurate fault localization candidates (at file, function, and line levels) from the multiple sampled outputs of LLMs, thereby enhancing localization accuracy. Furthermore, we constructed the HiFL-44k dataset, comprising approximately 44,000 fault localization instances, to train HiLoRM. Experimental results demonstrate that on the SWE-Bench-Lite dataset, HiLoRM improves the final line-level localization recall by 12% compared to a baseline model that does not use a reward model. Concurrently, HiLoRM exhibits a strong capability to evaluate predictions from larger LLMs (e.g., 32B parameters) and demonstrates transferability and generalization potential when applied to other fault localization methods. This work provides an effective methodology and an accessible model to significantly improve the accuracy and reliability of LLMs for repository-level fault localization. Our codes and datasets are available at https://github.com/SZU-ZJW/HiFL-Method.
R-CHAR: A Metacognition-Driven Framework for Role-Playing in Large Language Models
Haiming Qin | Jiwei Zhang | Wei Zhang | KeZhong Lu | Mingyang Zhou | Hao Liao | Rui Mao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Haiming Qin | Jiwei Zhang | Wei Zhang | KeZhong Lu | Mingyang Zhou | Hao Liao | Rui Mao
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Role-playing capabilities in large language models (LLMs) often lack cognitive consistency in complex scenarios that require deep understanding and coherent reasoning. While recent reasoning models excel in math and coding tasks, they show limited effectiveness in open-ended role-playing scenarios. We introduce R-CHAR (Role-Consistent Hierarchical Adaptive Reasoning), a metacognition-driven framework that enhances role-playing performance through guided thinking trajectories synthesis and adaptive evaluation. Our approach demonstrates that concise thinking processes can achieve superior performance efficiently compared to elaborate reasoning chains in role-playing social intelligence tasks, outperforming existing specialized models. Experimental results on the SocialBench benchmark show significant and stable performance improvements across varying scenario complexities, showing particular strength in long-context comprehension (from 34.64% to 68.59%) and group-level social interactions. Our work advances the development of cognitively consistent role-playing systems, bridging the gap between surface-level mimicry and authentic character simulation.
AIR-Bench: Automated Heterogeneous Information Retrieval Benchmark
Jianlyu Chen | Nan Wang | Chaofan Li | Bo Wang | Shitao Xiao | Han Xiao | Hao Liao | Defu Lian | Zheng Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Jianlyu Chen | Nan Wang | Chaofan Li | Bo Wang | Shitao Xiao | Han Xiao | Hao Liao | Defu Lian | Zheng Liu
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluation plays a crucial role in the advancement of information retrieval (IR) models. However, current benchmarks, which are based on predefined domains and human-labeled data, face limitations in addressing evaluation needs for emerging domains both cost-effectively and efficiently. To address this challenge, we propose the Automated Heterogeneous Information Retrieval Benchmark (AIR-Bench). AIR-Bench is distinguished by three key features: 1) Automated. The testing data in AIR-Bench is automatically generated by large language models (LLMs) without human intervention. 2) Heterogeneous. The testing data in AIR-Bench is generated with respect to diverse tasks, domains and languages. 3) Dynamic. The domains and languages covered by AIR-Bench are constantly augmented to provide an increasingly comprehensive evaluation benchmark for community developers. We develop a reliable and robust data generation pipeline to automatically create diverse and high-quality evaluation datasets based on real-world corpora. Our findings demonstrate that the generated testing data in AIR-Bench aligns well with human-labeled testing data, making AIR-Bench a dependable benchmark for evaluating IR models. The resources in AIR-Bench are publicly available at https://github.com/AIR-Bench/AIR-Bench.
Pretraining Context Compressor for Large Language Models with Embedding-Based Memory
Yuhong Dai | Jianxun Lian | Yitian Huang | Wei Zhang | Mingyang Zhou | Mingqi Wu | Xing Xie | Hao Liao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yuhong Dai | Jianxun Lian | Yitian Huang | Wei Zhang | Mingyang Zhou | Mingqi Wu | Xing Xie | Hao Liao
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Efficient processing of long contexts in large language models (LLMs) is essential for real-world applications like retrieval-augmented generation and in-context learning, especially in resource-constrained environments such as edge computing. This paper explores the embedding-based context compression to reduce inference costs while preserving the downstream LLM configurations. We propose a decoupled compressor-LLM framework, pretrained on text reconstruction and completion tasks, designed to effectively preserve essential contextual information within condensed embedding representations. Our extensive experiments investigate pretraining, model configurations, compression rates, efficiency across tasks, and adaptability to various LLMs. Results demonstrate that our approach outperforms competitive baselines in three domains and across eight datasets while being adaptable to different downstream LLMs. We find that thorough pretraining and carefully selected compression rates, such as 4x and 16x, enable a lightweight compressor to achieve a good balance between accuracy and speed. These findings underscore the potential of embedding-based compression to enhance LLM efficiency and motivate further research in this area.
2024
Aligning Large Language Models for Controllable Recommendations
Wensheng Lu | Jianxun Lian | Wei Zhang | Guanghua Li | Mingyang Zhou | Hao Liao | Xing Xie
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Wensheng Lu | Jianxun Lian | Wei Zhang | Guanghua Li | Mingyang Zhou | Hao Liao | Xing Xie
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Inspired by the exceptional general intelligence of Large Language Models (LLMs), researchers have begun to explore their application in pioneering the next generation of recommender systems — systems that are conversational, explainable, and controllable. However, existing literature primarily concentrates on integrating domain-specific knowledge into LLMs to enhance accuracy using a fixed task template, often overlooking the diversity of recommendation tasks and the ability of LLMs to follow recommendation-specific instructions. To address this gap, we first introduce a collection of supervised learning tasks, augmented with labels derived from a conventional recommender model, aimed at explicitly improving LLMs’ proficiency in adhering to recommendation-specific instructions. Next, we propose a reinforcement learning-based alignment procedure to enhance LLMs’ generalization ability. Extensive experiments on two real-world datasets demonstrate that our approach significantly improves the capability of LLMs to respond to instructions within recommender systems, reducing formatting errors while maintaining a high level of accuracy.
2023
Explainable Recommendation with Personalized Review Retrieval and Aspect Learning
Hao Cheng | Shuo Wang | Wensheng Lu | Wei Zhang | Mingyang Zhou | Kezhong Lu | Hao Liao
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hao Cheng | Shuo Wang | Wensheng Lu | Wei Zhang | Mingyang Zhou | Kezhong Lu | Hao Liao
Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Explainable recommendation is a technique that combines prediction and generation tasks to produce more persuasive results. Among these tasks, textual generation demands large amounts of data to achieve satisfactory accuracy. However, historical user reviews of items are often insufficient, making it challenging to ensure the precision of generated explanation text. To address this issue, we propose a novel model, ERRA (Explainable Recommendation by personalized Review retrieval and Aspect learning). With retrieval enhancement, ERRA can obtain additional information from the training sets. With this additional information, we can generate more accurate and informative explanations. Furthermore, to better capture users’ preferences, we incorporate an aspect enhancement component into our model. By selecting the top-n aspects that users are most concerned about for different items, we can model user representation with more relevant details, making the explanation more persuasive. To verify the effectiveness of our model, extensive experiments on three datasets show that our model outperforms state-of-the-art baselines (for example, 3.4% improvement in prediction and 15.8% improvement in explanation for TripAdvisor).
2022
A Joint Learning Framework for Restaurant Survival Prediction and Explanation
Xin Li | Xiaojie Zhang | Peng JiaHao | Rui Mao | Mingyang Zhou | Xing Xie | Hao Liao
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
Xin Li | Xiaojie Zhang | Peng JiaHao | Rui Mao | Mingyang Zhou | Xing Xie | Hao Liao
Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing
The bloom of the Internet and the recent breakthroughs in deep learning techniques open a new door to AI for E-commence, with a trend of evolving from using a few financial factors such as liquidity and profitability to using more advanced AI techniques to process complex and multi-modal data. In this paper, we tackle the practical problem of restaurant survival prediction. We argue that traditional methods ignore two essential respects, which are very helpful for the task: 1) modeling customer reviews and 2) jointly considering status prediction and result explanation. Thus, we propose a novel joint learning framework for explainable restaurant survival prediction based on the multi-modal data of user-restaurant interactions and users’ textual reviews. Moreover, we design a graph neural network to capture the high-order interactions and design a co-attention mechanism to capture the most informative and meaningful signal from noisy textual reviews. Our results on two datasets show a significant and consistent improvement over the SOTA techniques (average 6.8% improvement in prediction and 45.3% improvement in explanation).
Search
Fix author
Co-authors
- Mingyang Zhou 10
- Jianxun Lian 6
- Rui Mao 5
- Haiming Qin 5
- Naipeng Chao 3
- Kezhong Lu 3
- Wensheng Lu 3
- Mingqi Wu 3
- Jiwei Zhang 3
- Yitian Huang 2
- Defu Lian 2
- Zheng Liu 2
- Xing Xie 2
- Wei Zhang 2
- Wei Zhang 2
- Qimin Zhong 2
- Jianlyu Chen 1
- Wei Chen 1
- Hao Cheng 1
- Yuhong Dai 1
- Peng JiaHao 1
- Chaofan Li 1
- Chaozhuo Li 1
- Guanghua Li 1
- Xin Li 1
- Ze Liu 1
- Wenlong Shi 1
- Shuowangg 1
- Bo Wang 1
- Nan Wang 1
- Shuo Wang 1
- Han Xiao 1
- Shitao Xiao 1
- Xing Xie 1
- Xing Xie 1
- Wen Xiong 1
- Ruiran Yan 1
- Xiaojie Zhang 1
- Yong Zhang 1