Josef Kuchař
2025
Pre-trained Language Models Learn Remarkably Accurate Representations of Numbers
Marek Kadlčík
|
Michal Štefánik
|
Timothee Mickus
|
Josef Kuchař
|
Michal Spiegel
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Pretrained language models (LMs) are prone to arithmetic errors. Existing work showed limited success in probing numeric values from models’ representations, indicating that these errors can be attributed to the inherent unreliability of distributionally learned embeddings in representing exact quantities. However, we observe that previous probing methods are inadequate for the emergent structure of learned number embeddings with sinusoidal patterns.In response, we propose a novel probing technique that decodes numeric values from input embeddings with near-perfect accuracy across a range of open-source LMs. This proves that after the sole pre-training, LMs represent numbers with remarkable precision. Finally, we find that the embeddings’ preciseness judged by our probe’s accuracy explains a large portion of LM’s errors in elementary arithmetic, and show that aligning the embeddings with the pattern discovered by our probe can mitigate these errors.
Can Out-of-Distribution Evaluations Uncover Reliance on Prediction Shortcuts? A Case Study in Question Answering
Michal Štefánik
|
Timothee Mickus
|
Michal Spiegel
|
Marek Kadlčík
|
Josef Kuchař
Findings of the Association for Computational Linguistics: EMNLP 2025
A large body of recent work assesses models’ generalization capabilities through the lens of performance on out-of-distribution (OOD) datasets. Despite their practicality, such evaluations build upon a strong assumption: that OOD evaluations can capture and reflect upon possible failures in a real-world deployment. In this work, we challenge this assumption and confront the results obtained from OOD evaluations with a set of specific failure modes documented in existing question-answering (QA) models, referred to as a reliance on spurious features or prediction shortcuts.We find that different datasets used for OOD evaluations in QA provide an estimate of models’ robustness to shortcuts that have a vastly different quality, some largely under-performing even a simple, in-distribution evaluation. We partially attribute this to the observation that spurious shortcuts are shared across ID+OOD datasets, but also find cases where a dataset’s quality for training and evaluation is largely disconnected. Our work underlines limitations of commonly-used OOD-based evaluations of generalization, and provides methodology and recommendations for evaluating generalization within and beyond QA more robustly.