Jonas Wallat
2026
When Facts Change: Temporal Knowledge Conflict Resolution in LLMs
Jonas Wallat | Wolfgang Nejdl | Sandipan Sikdar
Findings of the Association for Computational Linguistics: ACL 2026
Jonas Wallat | Wolfgang Nejdl | Sandipan Sikdar
Findings of the Association for Computational Linguistics: ACL 2026
Retrieval-augmented generation (RAG) systems require large language models (LLMs) to reconcile discrepancies between their parametric memory—knowledge encoded during training—and contextual inputs provided at inference. When these sources conflict, models often exhibit unstable reasoning and inconsistent factual behavior. We investigate how LLMs resolve such conflicts when the discrepancy arises from temporal misalignment—facts that have changed since the model’s knowledge cutoff—and whether mutability, the changeability of facts, can serve as a mediating signal in this process. To do so, we provide WIKIRECENTCHANGES, a temporally grounded benchmark with stable and recently updated facts derived from Wikidata.Our results show that while models spontaneously produce temporal reasoning for facts that actually changed — but almost never for stable ones — this differentiation rarely propagates to their final predictions. Explicitly prompting them to consider mutability increases references to temporal change but does not improve factual accuracy, revealing a disconnect between verbalized reasoning and prediction behavior. We further show that the failure point is scale-dependent: smaller models rarely detect the underlying conflict, while larger models detect it but fail to act on their mutability judgments.
2025
A Study into Investigating Temporal Robustness of LLMs
Jonas Wallat | Abdelrahman Abdallah | Adam Jatowt | Avishek Anand
Findings of the Association for Computational Linguistics: ACL 2025
Jonas Wallat | Abdelrahman Abdallah | Adam Jatowt | Avishek Anand
Findings of the Association for Computational Linguistics: ACL 2025
Large Language Models (LLMs) encapsulate a surprising amount of factual world knowledge. However, their performance on temporal questions and historical knowledge is limited because they often cannot understand temporal scope and orientation or neglect the temporal aspect altogether.In this study, we aim to measure precisely how robust LLMs are for question answering based on their ability to process temporal information and perform tasks requiring temporal reasoning and temporal factual knowledge. Specifically, we design eight time-sensitiverobustness tests for factual information to check the sensitivity of six popular LLMs in the zero-shot setting.Overall, we find LLMs lacking temporal robustness, especially to temporal reformulations and the use of different granularities of temporal references. We show how a selection of these eight tests can be used automatically to judge a model’s temporal robustness for user questions on the fly. Finally, we apply the findings of this study to improve the temporal QA performance by up to 55%.
2020
BERTnesia: Investigating the capture and forgetting of knowledge in BERT
Jonas Wallat | Jaspreet Singh | Avishek Anand
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Jonas Wallat | Jaspreet Singh | Avishek Anand
Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP
Probing complex language models has recently revealed several insights into linguistic and semantic patterns found in the learned representations. In this paper, we probe BERT specifically to understand and measure the relational knowledge it captures. We utilize knowledge base completion tasks to probe every layer of pre-trained as well as fine-tuned BERT (ranking, question answering, NER). Our findings show that knowledge is not just contained in BERT’s final layers. Intermediate layers contribute a significant amount (17-60%) to the total knowledge found. Probing intermediate layers also reveals how different types of knowledge emerge at varying rates. When BERT is fine-tuned, relational knowledge is forgotten but the extent of forgetting is impacted by the fine-tuning objective but not the size of the dataset. We found that ranking models forget the least and retain more knowledge in their final layer.