Xing Xie
Other people with similar names: Xing Xie
Unverified author pages with similar names: Xing Xie
2026
Can AI Revise Research Papers with Human Review Feedback? An Empirical Study and Benchmark
Zihan Luo | Hong Huang | Jianxun Lian | Yu Chang | Xing Xie | Hai Jin
Findings of the Association for Computational Linguistics: ACL 2026
Zihan Luo | Hong Huang | Jianxun Lian | Yu Chang | Xing Xie | Hai Jin
Findings of the Association for Computational Linguistics: ACL 2026
The rise of Human-AI collaboration can effectively speed up the research process for experts and allow anyone with critical thinking skills to conduct innovative work. A key part of this collaboration is the AI’s ability to improve a paper with human feedback—updating both the text and experiments to meet high standards. To evaluate this skill, we introduce ReviseBench, an extensible benchmark built on real academic data that can be easily scaled via agent-driven automated data collection. It tests the skills of Large Language Models (LLMs) on paper interpretation, experimental implementation, and paper formulation, using authors’ camera-ready versions as natural human baselines. To facilitate a fine-grained assessment, we further propose ReviseArena, a platform supporting pair-wise comparisons between different AI-revised papers. Our initial evaluation results on ReviseBench reveal that even state-of-the-art foundation LLMs struggle significantly in this domain, achieving a win rate of less than 10% against human experts, and facing issues like incremental revision, unprofessional revision, and potential data fabrication. Our code and data are released publicly at: https://github.com/CGCL-codes/ReviseBench.
PersonaArena: Dynamic Simulation for Evaluating and Enhancing Persona-Level Role-Playing in Large Language Models
Wenlong Shi | Jianxun Lian | Mingqi Wu | Haiming Qin | Mingyang Zhou | Xing Xie | Naipeng Chao | Hao Liao
Findings of the Association for Computational Linguistics: ACL 2026
Wenlong Shi | Jianxun Lian | Mingqi Wu | Haiming Qin | Mingyang Zhou | Xing Xie | Naipeng Chao | Hao Liao
Findings of the Association for Computational Linguistics: ACL 2026
Large language models (LLMs) increasingly serve as interactive social agents, yet their ability to maintain coherent and authentic persona-level role-playing remains limited, particularly in realistic social scenarios. Existing research predominantly focuses on character-level settings and relies on static evaluation formats, failing to capture the complexity of everyday social interactions. In this work, we present PersonaArena, a dynamic simulation framework for evaluating and improving persona-level role-playing in LLMs. PersonaArena leverages a large, filtered corpus of user-generated social content to construct a nuanced persona bank, and elicits multi-turn, context-rich interactions within simulated social environments. Our framework features a multi-agent debating judge for holistic and unbiased assessment. Through extensive experiments, we demonstrate that PersonaArena enables rigorous evaluation and enhancement of LLMs’ role-playing capabilities, advancing the development of more authentic and socially adept AI agents. Our codes and long appendix are available at https://anonymous.4open.science/r/PersonaArena-B323/.
Does Theory of Mind Improvement Really Benefit Human-AI Interactions? Empirical Findings from Interactive Evaluations
Nanxu Gong | Zixin Chen | Haotian Li | Zishu Zhao | Jianxun Lian | Huamin Qu | Yanjie Fu | Xing Xie
Findings of the Association for Computational Linguistics: ACL 2026
Nanxu Gong | Zixin Chen | Haotian Li | Zishu Zhao | Jianxun Lian | Huamin Qu | Yanjie Fu | Xing Xie
Findings of the Association for Computational Linguistics: ACL 2026
Improving the Theory of Mind (ToM) capability of Large Language Models (LLMs) is crucial for effective social interactions between these AI models and humans. However, the existing benchmarks often measure ToM capability improvement through story-reading, multiple-choice questions from a third-person perspective, while ignoring the first-person, dynamic, and open-ended nature of human-AI (HAI) interactions. To directly examine how ToM improvement techniques benefit HAI interactions, we first proposed the new paradigm of interactive ToM evaluation with both perspective and metric shifts. Next, following the paradigm, we conducted a systematic study of four representative ToM enhancement techniques using both four real-world datasets and a user study, covering both goal-oriented tasks (e.g., coding, math) and experience-oriented tasks (e.g., counseling). Our findings reveal that improvements on static benchmarks do not always translate to better performance in dynamic HAI interactions. This paper offers critical insights into ToM evaluation, showing the necessity of interaction-based assessments in developing next-generation, socially aware LLMs for HAI symbiosis.
MMAC: A Multilingual, Multimodal Alignment Framework for Cultural Grounding Evaluation
Weihua Zheng | Zhengyuan Liu | Tanmoy Chakraborty | Weiwen Xu | Xiaoxue Gao | Bryan Chen Zhengyu Tan | Bowei Zou | Chang Liu | Yujia Hu | Xing Xie | Xiaoyuan Yi | Jing Yao | Chaojun Wang | Long Li | Rui Liu | Huiyao Liu | Koji Inoue | Ryuichi Sumida | Tatsuya Kawahara | Fan Xu | Lingyu Ye | Wei Tian | Dongjun Kim | Jimin Jung | Jaehyung Seo | Nadya Yuki Wangsajaya | Pham Minh Duc | Ojasva Saxena | Palash Nandi | Xiyan Tao | Wiwik Karlina | Tuan Luong | Keertana Arun Vasan | Roy Ka-Wei Lee | Nancy F. Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Weihua Zheng | Zhengyuan Liu | Tanmoy Chakraborty | Weiwen Xu | Xiaoxue Gao | Bryan Chen Zhengyu Tan | Bowei Zou | Chang Liu | Yujia Hu | Xing Xie | Xiaoyuan Yi | Jing Yao | Chaojun Wang | Long Li | Rui Liu | Huiyao Liu | Koji Inoue | Ryuichi Sumida | Tatsuya Kawahara | Fan Xu | Lingyu Ye | Wei Tian | Dongjun Kim | Jimin Jung | Jaehyung Seo | Nadya Yuki Wangsajaya | Pham Minh Duc | Ojasva Saxena | Palash Nandi | Xiyan Tao | Wiwik Karlina | Tuan Luong | Keertana Arun Vasan | Roy Ka-Wei Lee | Nancy F. Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The global deployment of Large Language Models (LLMs) underscores the urgent need to evaluate their cultural alignment. However, assessing genuine "cultural awareness" across modalities (text, vision, speech) and languages remains a significant challenge. To comprehensively investigate this domain, we propose MMAC, a systematic framework that encompasses a tri-modally aligned cultural benchmark creation pipeline and a five-dimensional evaluation protocol to assess cross-country awareness disparities, evaluate cross-lingual and cross-modal consistency, and verify cultural knowledge generalization and grounding validity. Given the prevailing Western cultural bias in current models, we focus on 8 Asian countries as our dataset foundation to more acutely reveal potential cultural deficiencies in LLMs. Our dataset, MMAC-bench, features 27,000 human-curated questions across 10 languages. Crucially, it is the first dataset aligned at the input level across text, image, and speech, enabling direct cross-modal transfer tests. Each question consists of multiple-choice options accompanied by open-ended generated explanations, where 79% require multi-step reasoning grounded in cultural context, moving beyond simple memorization. We probe the causes of modal divergence, offering insights into fostering culturally robust MLLMs.
Measuring Human Contribution in AI-Assisted Content Generation
Yueqi Xie | Tao Qi | Jingwei Yi | Xiyuan Yang | Ryan Whalen | Junming Huang | Qian Ding | Yu Xie | Xing Xie | Fangzhao Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yueqi Xie | Tao Qi | Jingwei Yi | Xiyuan Yang | Ryan Whalen | Junming Huang | Qian Ding | Yu Xie | Xing Xie | Fangzhao Wu
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
With the growing prevalence of generative AI, an increasing amount of content is no longer exclusively generated by humans but by generative AI models with human guidance. This shift presents notable challenges for the delineation of originality due to the varying degrees of human contribution in AI-assisted works. This study raises the research question of measuring human contribution in AI-assisted content generation and introduces a framework to address this question that is grounded in information theory. By calculating mutual information between human input and AI-assisted output relative to self-information of AI-assisted output, we quantify the proportional information contribution of humans in content generation. Our experimental results demonstrate that the proposed measure effectively discriminates between varying degrees of human contribution across multiple creative domains. To further enhance real-world applicability, we extend the framework to estimate the minimal necessary human contribution for any text without requiring human input and validate its effectiveness. We hope that this work lays a foundation for measuring human contributions in AI-assisted content generation in the era of generative AI.
Influence-based Online Experience Selection for Effective RLHF
Yifan Gong | Jing Yao | Xiting Wang | Xunlong Wang | Xiaoyuan Yi | Xing Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifan Gong | Jing Yao | Xiting Wang | Xunlong Wang | Xiaoyuan Yi | Xing Xie
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Reinforcement Learning from Human Feedback (RLHF) has emerged as a crucial technique for aligning large language models (LLMs) with human preferences. However, existing RLHF methods face key challenges, including poor sample efficiency, high computational overhead, and slow convergence. Recent studies highlight the importance of data selection in RL, but how to effectively select the most beneficial experiences for RL training remains an open problem. Existing data selection methods for RL rely on heuristic metrics, failing to establish an interpretable connection between data and optimization objectives. To address this problem, we propose InfOES (Influence-based Online Experience Selection), a novel data selection method for RLHF that dynamically estimates the influence of individual training samples on policy optimization. By incorporating data attribution into the policy gradient, InfOES can identify and filter out detrimental samples on the fly, ensuring effective convergence toward alignment objectives. Our approach is compatible with various RL algorithms (e.g., PPO, GRPO, REINFORCE++). Extensive experiments demonstrate that InfOES significantly enhances training effectiveness, achieving superior alignment performance with fewer optimization steps.
Can Persona-Prompted LLMs Emulate Subgroup Values? An Empirical Analysis of Generalisability and Fairness in Cultural Alignment
Bryan Chen Zhengyu Tan | Zhengyuan Liu | Xiaoyuan Yi | Jing Yao | Xing Xie | Nancy F. Chen | Roy Ka-Wei Lee
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Bryan Chen Zhengyu Tan | Zhengyuan Liu | Xiaoyuan Yi | Jing Yao | Xing Xie | Nancy F. Chen | Roy Ka-Wei Lee
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Despite their global prevalence, many Large Language Models (LLMs) are aligned to a monolithic, often Western-centric set of values. This paper investigates the more challenging task of fine-grained value alignment: examining whether LLMs can emulate the distinct cultural values of demographic subgroups. Using Singapore as a case study and the World Values Survey (WVS), we examine the value landscape and show that even state-of-the-art models like GPT-4.1 achieve only 57.4% accuracy in predicting subgroup modal preferences. We construct a dataset of over 20,000 samples to train and evaluate a range of models. We demonstrate that simple fine-tuning on structured numerical preferences yields substantial gains, improving accuracy on unseen, out-of-distribution subgroups by an average of 17.4%. These gains partially transfer to open-ended generation. However, we find significant pre-existing performance biases, where models better emulate young, male, Chinese, and Christian personas. Furthermore, while fine-tuning improves average performance, it widens the disparity between subgroups when measured by distance-aware metrics. Our work offers insights into the limits and fairness implications of subgroup-level cultural alignment.
2025
MoVa: Towards Generalizable Classification of Human Morals and Values
Ziyu Chen | Junfei Sun | Chenxi Li | Tuan Dung Nguyen | Jing Yao | Xiaoyuan Yi | Xing Xie | Chenhao Tan | Lexing Xie
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Ziyu Chen | Junfei Sun | Chenxi Li | Tuan Dung Nguyen | Jing Yao | Xiaoyuan Yi | Xing Xie | Chenhao Tan | Lexing Xie
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Identifying human morals and values embedded in language is essential to empirical studies of communication. However, researchers often face substantial difficulty navigating the diversity of theoretical frameworks and data available for their analysis. Here, we contribute MoVa, a well-documented suite of resources for generalizable classification of human morals and values, consisting of (1) 16 labeled datasets and benchmarking results from four theoretically-grounded frameworks; (2) a lightweight LLM prompting strategy that outperforms fine-tuned models across multiple domains and frameworks; and (3) a new application that helps evaluate psychological surveys. In practice, we specifically recommend a classification strategy, all@once, that scores all related concepts simultaneously, resembling the well-known multi-label classifier chain. The data and methods in MoVa can facilitate many fine-grained interpretations of human and machine communication, with potential implications for the alignment of machine behavior.
Search
Fix author
Co-authors
- Jing Yao 4
- Xiaoyuan Yi 4
- Jianxun Lian 3
- Nancy Chen 2
- Roy Ka-Wei Lee 2
- Zhengyuan Liu 2
- Bryan Chen Zhengyu Tan 2
- Tanmoy Chakraborty 1
- Yu Chang 1
- Naipeng Chao 1
- Zixin Chen 1
- Ziyu Chen 1
- Qian Ding 1
- Pham Minh Duc 1
- Yanjie Fu 1
- Xiaoxue Gao 1
- Nanxu Gong 1
- Yifan Gong 1
- Yujia Hu 1
- Hong Huang 1
- Junming Huang 1
- Koji Inoue 1
- Hai Jin 1
- Jimin Jung 1
- Wiwik Karlina 1
- Tatsuya Kawahara 1
- Dongjun Kim 1
- Chenxi Li 1
- Haotian Li 1
- Long Li 1
- Hao Liao 1
- Chang Liu 1
- Huiyao Liu 1
- Rui Liu 1
- Zihan Luo 1
- Tuan Luong 1
- Palash Nandi 1
- Tuan Dung Nguyen 1
- Tao Qi 1
- Haiming Qin 1
- Huamin Qu 1
- Ojasva Saxena 1
- Jaehyung Seo 1
- Wenlong Shi 1
- Ryuichi Sumida 1
- Junfei Sun 1
- Chenhao Tan 1
- Xiyan Tao 1
- Wei Tian (田巍) 1
- Keertana Arun Vasan 1
- Chaojun Wang 1
- Xiting Wang 1
- Xunlong Wang 1
- Nadya Yuki Wangsajaya 1
- Ryan Whalen 1
- Fangzhao Wu 1
- Mingqi Wu 1
- Lexing Xie 1
- Yu Xie 1
- Yueqi Xie 1
- Fan Xu (徐凡) 1
- Weiwen Xu 1
- Xiyuan Yang 1
- Lingyu Ye 1
- Jingwei Yi 1
- Zishu Zhao 1
- Weihua Zheng 1
- Mingyang Zhou 1
- Bowei Zou (邹博伟) 1