Time to Revisit Exact Match

Auss Abbood; Zaiqiao Meng; Nigel Collier

doi:10.18653/v1/2025.findings-emnlp.637

Time to Revisit Exact Match

Auss Abbood, Zaiqiao Meng, Nigel Collier

Abstract

Temporal question answering is an established method for evaluating temporal reasoning in large language models. Expected answers are often numeric (e.g., dates or durations), yet model responses are evaluated like regular text with exact match (EM), unable to distinguish small from large errors. In this investigative work, we frame temporal question answering as a numerical estimation task to assess the shortcomings of EM. We introduce TempAnswerQA, a benchmark distilled from Test of Time and TempTabQA, where all questions require a numerical, temporal answer, allowing us to evaluate models beyond EM. We use the forecasting metrics symmetric mean absolute percentage error (sMAPE) and mean absolute scaled error (MASE). With sMAPE, we find that error size and EM are decoupled. Models with low EM still have low sMAPE (both 20%), and some models have high sMAPE despite high EM. Scaling errors by the deviation of the ground truth data with MASE reshuffles model rankings compared to EM, revealing gaps in models’ understanding of temporal domain knowledge, especially when trained with synthetic data. Lastly, the models’ most frequent error is to deviate by only ±1 from the ground truth. sMAPE and MASE, unlike EM, adequately weight these errors. Our findings underscore the need for specialised metrics for temporal QA tasks. Our code and data are available on https://github.com/aauss/temporal-answer-qa.

Anthology ID:: 2025.findings-emnlp.637
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 11903–11926
Language:
URL:: https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.637/
DOI:: 10.18653/v1/2025.findings-emnlp.637
Bibkey:
Cite (ACL):: Auss Abbood, Zaiqiao Meng, and Nigel Collier. 2025. Time to Revisit Exact Match. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 11903–11926, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Time to Revisit Exact Match (Abbood et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-luhme/2025.findings-emnlp.637.pdf
Checklist:: 2025.findings-emnlp.637.checklist.pdf

PDF Cite Search Checklist Fix data