Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning

Gagan Bhatia, Maxime Peyrard, Wei Zhao


Abstract
Modern BPE tokenisers often split calendar dates into meaningless fragments, e.g., “20250312” “202”, “503”, “12”, inflating token counts and obscuring the inherent structure needed for robust temporal reasoning. In this work, we (1) introduce a simple yet interpretable metric, termed date fragmentation ratio, that measures how faithfully a tokeniser preserves multi-digit date components; (2) release DateAugBench, a suite of 6500 examples spanning three temporal reasoning tasks: context-based date resolution, format-invariance puzzles, and date arithmetic across historical, contemporary, and future time periods; and (3) through layer-wise probing and causal attention-hop analyses, uncover an emergent date-abstraction mechanism whereby large language models stitch together the fragments of month, day, and year components for temporal reasoning. Our experiments show that excessive fragmentation correlates with accuracy drops of up to 10 points on uncommon dates like historical and futuristic dates. Further, we find that the larger the model, the faster the emergent date abstraction heals date fragments. Lastly, we observe a reasoning path that LLMs follow to assemble date fragments, typically differing from human interpretation (year month day).
Anthology ID:
2025.emnlp-main.159
Volume:
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:
November
Year:
2025
Address:
Suzhou, China
Editors:
Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:
EMNLP
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
3201–3219
Language:
URL:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.159/
DOI:
Bibkey:
Cite (ACL):
Gagan Bhatia, Maxime Peyrard, and Wei Zhao. 2025. Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 3201–3219, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):
Date Fragments: A Hidden Bottleneck of Tokenization for Temporal Reasoning (Bhatia et al., EMNLP 2025)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.159.pdf
Checklist:
 2025.emnlp-main.159.checklist.pdf