How Training Data Shapes the Use of Parametric and In-Context Knowledge in Language Models

Minsung Kim; Dong-Kyum Kim; Jea Kwon; Nakyeong Yang; Kyomin Jung; Meeyoung Cha

How Training Data Shapes the Use of Parametric and In-Context Knowledge in Language Models

Minsung Kim, Dong-Kyum Kim, Jea Kwon, Nakyeong Yang, Kyomin Jung, Meeyoung Cha

Abstract

Large language models leverage both parametric knowledge acquired during pretraining and in-context knowledge provided at inference time. Crucially, when these sources conflict, models arbitrate based on their internal confidence, preferring parametric knowledge for high-confidence facts while deferring to context for less familiar ones. However, the training conditions that give rise to these fundamental behaviors remain unclear. Here we conduct controlled experiments using synthetic corpora to identify the specific data properties that shape knowledge utilization. Our results reveal a counterintuitive finding: the robust, balanced use of both knowledge sources is an emergent property that requires the co-occurrence of three factors typically considered detrimental, including (i) intra-document repetition, (ii) a moderate degree of intra-document inconsistency, and (iii) a skewed knowledge distribution. We further show that these dynamics arise in real-world language model pretraining and analyze how post-training procedures reshape arbitration strategies. Together, our findings provide empirical guidance for designing training data that supports the reliable integration of parametric and in-context knowledge in language models.

Anthology ID:: 2026.acl-long.1064
Volume:: Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: ACL
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 23242–23257
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1064/
DOI:
Bibkey:
Cite (ACL):: Minsung Kim, Dong-Kyum Kim, Jea Kwon, Nakyeong Yang, Kyomin Jung, and Meeyoung Cha. 2026. How Training Data Shapes the Use of Parametric and In-Context Knowledge in Language Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 23242–23257, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: How Training Data Shapes the Use of Parametric and In-Context Knowledge in Language Models (Kim et al., ACL 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.acl-long.1064.pdf
Checklist:: 2026.acl-long.1064.checklist.pdf

PDF Cite Search Checklist Fix data