AnaToM: A Dataset Generation Framework for Evaluating Theory of Mind Reasoning Toward the Anatomy of Difficulty through Structurally Controlled Story Generation

Jundai Suzuki, Ryoma Ishigaki, Eisaku Maeda


Abstract
Evaluating Theory of Mind (ToM) in Large Language Models (LLMs) is an important area of research for understanding the social intelligence of AI. Recent ToM benchmarks have made significant strides in enhancing the complexity, comprehensiveness, and practicality of evaluation. However, while the focus has been on constructing “more difficult” or “more comprehensive” tasks, there has been insufficient systematic analysis of the structural factors that inherently determine the difficulty of ToM reasoning—that is, “what” makes reasoning difficult. To address this challenge, we propose a new dataset generation framework for ToM evaluation named AnaToM. To realize an “Anatomy of Difficulty” in ToM reasoning, AnaToM strictly controls structural parameters such as the number of entities and the timeline in a story. This parameter control enables the isolation and identification of factors affecting the ToM of LLMs, allowing for a more precise examination of their reasoning mechanisms. The proposed framework provides a systematic methodology for diagnosing the limits of LLM reasoning abilities and offers new guidelines for future benchmark design.
Anthology ID:
2025.findings-ijcnlp.14
Volume:
Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
Month:
December
Year:
2025
Address:
Mumbai, India
Editors:
Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
Venue:
Findings
SIG:
Publisher:
The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
Note:
Pages:
244–257
Language:
URL:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.findings-ijcnlp.14/
DOI:
Bibkey:
Cite (ACL):
Jundai Suzuki, Ryoma Ishigaki, and Eisaku Maeda. 2025. AnaToM: A Dataset Generation Framework for Evaluating Theory of Mind Reasoning Toward the Anatomy of Difficulty through Structurally Controlled Story Generation. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 244–257, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
Cite (Informal):
AnaToM: A Dataset Generation Framework for Evaluating Theory of Mind Reasoning Toward the Anatomy of Difficulty through Structurally Controlled Story Generation (Suzuki et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-ijcnlp-aacl/2025.findings-ijcnlp.14.pdf