Tingjian Ge


2025

pdf bib
Token Knowledge: A New Perspective For Knowledge in Large Language Models
Jieyong Wang | Chunyao Song | Tingjian Ge
Findings of the Association for Computational Linguistics: EMNLP 2025

In the era of prosperity of large language models (LLMs), hallucination remains a serious issue hindering LLMs’ expansion and reliability. Predicting the presence (and absence) of certain knowledge in LLMs could aid the hallucination avoidance. However, the token-based generation mode of LLM is different from the knowledge storage structure in the form of triples, which makes it difficult to accurately evaluate the knowledge boundary of LLM. We approach this problem from a novel perspective and, for the first time, introduce the concept of token knowledge in large language models. Consequently, we propose a token knowledge dataset construction method and use the intermediate states during inference to train probes. This allows us to predict if a specific token will appear in the LLM’s generated sequence, without even generating a single token. Our approach unlocks the model’s latent potential, enhancing its accuracy in assessing token knowledge from about 60% to over 90%, with strong out-of-distribution generalization by training on just a few dozen prompts. Finally, we apply KEGT to enhance a state-of-the-art knowledge boundary detection method, achieving improved performance while reducing computational time by over 90%. Furthermore, KEGT enables prevention of hallucinations in certain cases by leveraging its guidance in the token-level knowledge semantic space. Our code is available at https://github.com/CC-2000/KEGT.