ETRQA: A Comprehensive Benchmark for Evaluating Event Temporal Reasoning Abilities of Large Language Models

Sigang Luo, Yinan Liu, Dongying Lin, Yingying Zhai, Bin Wang, Xiaochun Yang, Junpeng Liu


Abstract
Event temporal reasoning (ETR) aims to model and reason about the relationships between events and time, as well as between events in the real world. Proficiency in ETR is a significant indicator that a large language model (LLM) truly understands the physical world. Previous question-answering datasets available for evaluating the ETR ability lack a systematic taxonomy and pay limited attention to compound questions. In this paper, we propose a unified taxonomy for event temporal questions and construct a comprehensive benchmark ETRQA, to evaluate the ETR abilities of LLMs based on this taxonomy. ETRQA not only inherits and expands the evaluation content of existing datasets but also contains multiple categories of compound questions. We evaluate two leading LLM series, Llama and Qwen, on ETRQA across various settings. Our experimental results indicate that large-scale LLMs exhibit certain ETR abilities. Yet they do not perform well in solving specific types of reasoning tasks, including reasoning involving time spans, reasoning for compound questions, and reasoning with fine temporal granularity. Additionally, we hope ETRQA can benefit the temporal reasoning research community for future studies.
Anthology ID:
2025.findings-acl.1198
Volume:
Findings of the Association for Computational Linguistics: ACL 2025
Month:
July
Year:
2025
Address:
Vienna, Austria
Editors:
Wanxiang Che, Joyce Nabende, Ekaterina Shutova, Mohammad Taher Pilehvar
Venue:
Findings
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
23321–23339
Language:
URL:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1198/
DOI:
Bibkey:
Cite (ACL):
Sigang Luo, Yinan Liu, Dongying Lin, Yingying Zhai, Bin Wang, Xiaochun Yang, and Junpeng Liu. 2025. ETRQA: A Comprehensive Benchmark for Evaluating Event Temporal Reasoning Abilities of Large Language Models. In Findings of the Association for Computational Linguistics: ACL 2025, pages 23321–23339, Vienna, Austria. Association for Computational Linguistics.
Cite (Informal):
ETRQA: A Comprehensive Benchmark for Evaluating Event Temporal Reasoning Abilities of Large Language Models (Luo et al., Findings 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/display_plenaries/2025.findings-acl.1198.pdf