Deming Ding

2026

LLM role-playing, i.e., using large language models (LLMs) to simulate specific personas, has emerged as a key capability in various applications, such as companionship, content creation, and digital games. While current models effectively capture character tones and knowledge, simulating the inner thoughts behind their behaviors remains a non-trivial challenge. Towards cognitive simulation in LLM role-play, previous efforts have mainly suffered from two critical deficiencies: the lack of high-quality datasets with explicit reasoning traces and the absence of reliable reward signals aligned with human preferences. In this paper, we propose HER (Human Emulation Reasoning), a unified framework for cognitive-level persona simulation. HER introduces a dual-layer thinking mechanism that strictly distinguishes characters’ first-person thinking processes from LLMs’ third-person reasoning. To bridge the aforementioned gaps, we curate a reasoning-augmented role-playing dataset via a reverse engineering strategy for supervised learning, and construct human-aligned evaluation principles and preference-based reward models for role-play reinforcement learning. Leveraging these resources, we train HER models based on the Qwen3-32B backbone via a hybrid paradigm of supervised learning (SL) and reinforcement learning from human feedback (RLHF). Extensive experiments validate the effectiveness of our approach. Notably, our models significantly outperform the Qwen3-32B baseline, achieving a 30.26% on the CoSER benchmark and a 14.97% on the MiniMax Benchmark. Our datasets, evaluation principles, and trained models will be released to facilitate future research in cognitive-level LLM role-playing.

Modern coding scaffolds turn LLMs into capable software agents, but their ability to follow scaffold-specified instructions remains under-examined, especially when constraints are heterogeneous and persist across interactions. To fill this gap, we introduce OctoBench, which benchmarks scaffold-aware instruction following in repository-grounded agentic coding. OctoBench includes 34 environments and 217 tasks instantiated under three scaffold types, and is paired with 7,098 objective checklist items. To disentangle solving the task from following the rules, we provide an automated observation-and-scoring toolkit that captures full trajectories and performs fine-grained checks. Experiments on eight representative models reveal a systematic gap between task-solving and scaffold-aware compliance, underscoring the need for training and evaluation that explicitly targets heterogeneous instruction following. We will release the benchmark to support reproducible benchmarking and to accelerate the development of more scaffold-aware coding agents.