Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

Christopher Adrian Kusuma; Muhammad Reza Qorib; Hwee Tou Ng

Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data

Christopher Adrian Kusuma, Muhammad Reza Qorib, Hwee Tou Ng

Abstract

Large language models (LLMs) are highly capable of answering questions, but they are often unaware of their own knowledge boundary, i.e., knowing what they know and what they don’t know. As a result, they can generate factually incorrect responses on topics they do not have enough knowledge of, commonly known as hallucination. Rather than hallucinating, a language model should be more honest and respond with "I don’t know" when it does not have enough knowledge about a topic. Many methods have been proposed to improve LLM honesty, but their evaluations lack robustness, as they do not take into account the knowledge that the LLM has ingested during its pretraining. In this paper, we propose a more robust evaluation benchmark dataset for LLM honesty by utilizing Pythia, a truly open LLM with publicly available pretraining data. In addition, we also propose a novel method for harnessing the pretraining data to build a more honest LLM.

Anthology ID:: 2026.findings-acl.1935
Volume:: Findings of the Association for Computational Linguistics: ACL 2026
Month:: July
Year:: 2026
Address:: San Diego, California, United States
Editors:: Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 38858–38871
Language:
URL:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1935/
DOI:
Bibkey:
Cite (ACL):: Christopher Adrian Kusuma, Muhammad Reza Qorib, and Hwee Tou Ng. 2026. Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data. In Findings of the Association for Computational Linguistics: ACL 2026, pages 38858–38871, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):: Parametric Knowledge is Not All You Need: Toward Honest Large Language Models via Retrieval of Pretraining Data (Kusuma et al., Findings 2026)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1935.pdf
Checklist:: 2026.findings-acl.1935.checklist.pdf

PDF Cite Search Checklist Fix data