Mingyu Wang
2025
Guess What I am Thinking: A Benchmark for Inner Thought Reasoning of Role-Playing Language Agents
Rui Xu
|
Mingyu Wang
|
Xintao Wang
|
Dakuan Lu
|
Xiaoyu Tan
|
Wei Chu
|
Xu Yinghui
Findings of the Association for Computational Linguistics: EMNLP 2025
Recent advances in Large Language Model (LLM)-based Role-Playing Language Agents (RPLAs) have attracted broad attention in various applications. While chain-of-thought reasoning has shown importance in many tasks for LLMs, the internal thinking processes of RPLAs remain unexplored. Understanding characters’ inner thoughts is crucial for developing advanced RPLAs. In this paper, we introduce ROLETHINK, a novel benchmark constructed from literature for evaluating character thought generation. We propose the task of inner thought reasoning, constructing 6,058 data entries from 76 books, which includes two sets: the gold set that compares generated thoughts with original character monologues, and the silver set that uses expert-synthesized character analyses as references. To address this challenge, we propose MIRROR, a chain-of-thought approach that generates character thoughts by retrieving memories, predicting character reactions, and synthesizing motivations. Through extensive experiments, we demonstrate the importance of inner thought reasoning for RPLAs, and MIRROR consistently outperforms existing methods.
FactCG: Enhancing Fact Checkers with Graph-Based Multi-Hop Data
Deren Lei
|
Yaxi Li
|
Siyao Li
|
Mengya Hu
|
Rui Xu
|
Ken Archer
|
Mingyu Wang
|
Emily Ching
|
Alex Deng
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Prior research on training grounded factuality classification models to detect hallucinations in large language models (LLMs) has relied on public natural language inference (NLI) data and synthetic data. However, conventional NLI datasets are not well-suited for document-level reasoning, which is critical for detecting LLM hallucinations. Recent approaches to document-level synthetic data generation involve iteratively removing sentences from documents and annotating factuality using LLM-based prompts. While effective, this method is computationally expensive for long documents and limited by the LLM’s capabilities. In this work, we analyze the differences between existing synthetic training data used in state-of-the-art models and real LLM output claims. Based on our findings, we propose a novel approach for synthetic data generation, CG2C, that leverages multi-hop reasoning on context graphs extracted from documents. Our fact checker model, FactCG, demonstrates improved performance with more connected reasoning, using the same backbone models. Experiments show it even outperforms GPT-4-o on the LLM-Aggrefact benchmark with much smaller model size.