Ruirui Wang
2026
ToMELP: A Theory-of-Mind Benchmark for Route-Controlled Persuasion under the Elaboration Likelihood Model
Ruirui Wang | Haoran Zhang | Tian Lan | Zehua Duo | Jiang Li | Guanglai Gao | Xiangdong Su
Findings of the Association for Computational Linguistics: ACL 2026
Ruirui Wang | Haoran Zhang | Tian Lan | Zehua Duo | Jiang Li | Guanglai Gao | Xiangdong Su
Findings of the Association for Computational Linguistics: ACL 2026
Theory of Mind (ToM) is widely regarded as central to effective persuasion, yet existing evaluations often fail to capture the infer–apply loop that arises in real-world dialogue. We introduce Theory-of-Mind-Guided Elaboration-Likelihood Persuasion (ToMELP), a benchmark that jointly conditions on the audience persona p and the Elaboration Likelihood Model (ELM) route r ∈ {central, peripheral} within persuasive conversations. The benchmark tests whether large language models can perform ToM inference over multi-turn interactions and leverage these inferences for controllable persuasive generation. ToMELP provides a structured interface with evidence annotations, enabling automated evaluation of persuasive effectiveness, route alignment/deviation, evidence quality under the central route, and robustness to perturbations.
2025
McBE: A Multi-task Chinese Bias Evaluation Benchmark for Large Language Models
Tian Lan | Xiangdong Su | Xu Liu | Ruirui Wang | Ke Chang | Jiang Li | Guanglai Gao
Findings of the Association for Computational Linguistics: ACL 2025
Tian Lan | Xiangdong Su | Xu Liu | Ruirui Wang | Ke Chang | Jiang Li | Guanglai Gao
Findings of the Association for Computational Linguistics: ACL 2025
As large language models (LLMs) are increasingly applied to various NLP tasks, their inherent biases are gradually disclosed. Therefore, measuring biases in LLMs is crucial to mitigate its ethical risks. However, most existing bias evaluation datasets are focus on English andNorth American culture, and their bias categories are not fully applicable to other cultures. The datasets grounded in the Chinese language and culture are scarce. More importantly, these datasets usually only support single evaluation task and cannot evaluate the bias from multiple aspects in LLMs. To address these issues, we present a Multi-task Chinese Bias Evaluation Benchmark (McBE) that includes 4,077 bias evaluation instances, covering 12 single bias categories, 82 subcategories and introducing 5 evaluation tasks, providing extensive category coverage, content diversity, and measuring comprehensiveness. Additionally, we evaluate several popular LLMs from different series and with parameter sizes. In general, all these LLMs demonstrated varying degrees of bias. We conduct an in-depth analysis of results, offering novel insights into bias in LLMs.