Zishan Huang
2026
HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
Chengyu Du | Xintao Wang | Aili Chen | Weiyuan Li | Rui Xu | Junteng Liu | Zishan Huang | Rong Tian | Zijun Sun | Yuhao Li | Liheng Feng | Deming Ding | Pengyu Zhao | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
Chengyu Du | Xintao Wang | Aili Chen | Weiyuan Li | Rui Xu | Junteng Liu | Zishan Huang | Rong Tian | Zijun Sun | Yuhao Li | Liheng Feng | Deming Ding | Pengyu Zhao | Yanghua Xiao
Findings of the Association for Computational Linguistics: ACL 2026
LLM role-playing, i.e., using large language models (LLMs) to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a non-trivial challenge. Towards cognitive simulation in LLM role-play, previous efforts have mainly suffered from two critical deficiencies: the lack of high-quality datasets with explicit reasoning traces and the absence of reliable reward signals aligned with human preferences. In this paper, we propose HER (Human Emulation Reasoning), a unified framework for cognitive-level persona simulation. HER introduces a dual-layer thinking mechanism that strictly distinguishes characters’ first-person thinking processes from LLMs’ third-person reasoning. To bridge the aforementioned gaps, we curate a reasoning-augmented role-playing dataset via a reverse engineering strategy for supervised learning, and construct human-aligned evaluation principles and preference-based reward models for role-play reinforcement learning. Leveraging these resources, we train HER models based on the Qwen3-32B backbone via a hybrid paradigm of supervised learning (SL) and reinforcement learning from human feedback (RLHF). Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26% on the CoSER benchmark and a 14.97% on the MiniMax Benchmark. Our datasets, evaluation principles, and trained models will be released to facilitate future research in cognitive-level LLM role-playing.