Proceedings of the 3rd Workshop on Towards Knowledgeable Foundation Models (KnowFM)

Yuji Zhang, Canyu Chen, Sha Li, Mor Geva, Chi Han, Xiaozhi Wang, Shangbin Feng, Silin Gao, Isabelle Augenstein, Mohit Bansal, Manling Li, Heng Ji (Editors)

Anthology ID:: 2025.knowllm-1
Month:: August
Year:: 2025
Address:: Vienna, Austria
Venues:: KnowLLM | WS
SIG:
Publisher:: Association for Computational Linguistics
URL:: https://preview.aclanthology.org/landing_page/2025.knowllm-1/
DOI:: 10.18653/v1/2025.knowllm-1
ISBN:: 979-8-89176-283-1
Bib Export formats:: BibTeX
PDF:: https://preview.aclanthology.org/landing_page/2025.knowllm-1.pdf

PDF (full) BibTeX Search

The rapid expansion of digital information and knowledge across structured and unstructured sources has heightened the importance of Information Retrieval (IR). While dense retrieval methods have substantially improved semantic matching for general queries, they consistently underperform on queries with explicit temporal constraints–often those containing numerical expressions and time specifiers such as “in 2015.” Existing approaches to Temporal Information Retrieval (TIR) improve temporal reasoning but often suffer from catastrophic forgetting, leading to reduced performance on non-temporal queries. To address this, we propose Time-Specifier Model Merging (TSM), a novel method that enhances temporal retrieval while preserving accuracy on non-temporal queries. TSM trains specialized retrievers for individual time specifiers and merges them into a unified model, enabling precise handling of temporal constraints without compromising non-temporal retrieval. Extensive experiments on both temporal and non-temporal datasets demonstrate that TSM significantly improves performance on temporally constrained queries while maintaining strong results on non-temporal queries, consistently outperforming other training methods. Our code is available at https://github.com/seungyoonee/TSM.

pdf bib abs
EdTec-ItemGen: Enhancing Retrieval-Augmented Item Generation Through Key Point Extraction
Alonso Palomino | David Buschhüter | Roland Roller | Niels Pinkwart | Benjamin Paassen

A major bottleneck in exam construction involves designing test items (i.e., questions) that accurately reflect key content from domain-aligned curricular materials. For instance, during formative assessments in vocational education and training (VET), exam designers must generate updated test items that assess student learning progress while covering the full breadth of topics in the curriculum. Large language models (LLMs) can partially support this process, but effective use requires careful prompting and task-specific understanding. We propose a new key point extraction method for retrieval-augmented item generation that enhances the process of generating test items with LLMs. We exhaustively evaluated our method using a TREC-RAG approach, finding that prompting LLMs with key content rather than directly using full curricular text passages significantly improves item quality regarding key information coverage by 8%. To demonstrate these findings, we release EdTec-ItemGen, a retrieval-augmented item generation demo tool to support item generation in education.

Large language models (LLMs) have achieved great success, but their occasional content fabrication, or hallucination, limits their practical application. Hallucination arises because LLMs struggle to admit ignorance due to inadequate training on knowledge boundaries. We call it a limitation of LLMs that they can not accurately express their knowledge boundary, answering questions they know while admitting ignorance to questions they do not know. In this paper, we aim to teach LLMs to recognize and express their knowledge boundary, so they can reduce hallucinations caused by fabricating when they do not know. We propose CoKE, which first probes LLMs’ knowledge boundary via internal confidence given a set of questions, and then leverages the probing results to elicit the expression of the knowledge boundary. Extensive experiments show CoKE helps LLMs express knowledge boundaries, answering known questions while declining unknown ones, significantly improving in-domain and out-of-domain performance.

pdf bib abs
Knowledge-Grounded Detection of Cryptocurrency Scams with Retrieval-Augmented LMs
Zichao Li

This paper presents a knowledge-grounded framework for cryptocurrency scam detection using retrieval-augmented language models. We address three key limitations of existing approaches: static knowledge bases, unreliable LM outputs, and fixed classification thresholds. Our method combines (1) temporally-weighted retrieval from scam databases, (2) confidence-aware fusion of parametric and external knowledge, and (3) adaptive threshold optimization via gradient ascent. Experiments on CryptoScams and Twitter Financial Scams datasets demonstrate state-of-the-art performance, with 22% higher recall at equivalent precision compared to fixed thresholds, 4.3× lower hallucination rates than pure LMs, and 89% temporal performance retention on emerging scam types. The system achieves real-time operation (45ms/query) while maintaining interpretability through evidence grounding. Ablation studies confirm each component’s necessity, with confidence fusion proving most critical (12.1% performance drop when removed). These advances enable more robust monitoring of evolving cryptocurrency threats while addressing fundamental challenges in knowledgeable foundation models.

pdf bib abs
Stress-Testing Multimodal Foundation Models for Crystallographic Reasoning
Can Polat | Hasan Kurban | Erchin Serpedin | Mustafa Kurban

Evaluating foundation models for crystallographic reasoning requires benchmarks that isolate generalization behavior while enforcing physical constraints. This work introduces, xCrysAlloys, a multiscale multicrystal dataset with two physically grounded evaluation protocols to stress-test multimodal generative models. The Spatial-Exclusion benchmark withholds all supercells of a given radius from a diverse dataset, enabling controlled assessments of spatial interpolation and extrapolation. The Compositional-Exclusion benchmark omits all samples of a specific chemical composition, probing generalization across stoichiometries. Nine vision–language foundation models are prompted with crystallographic images and textual context to generate structural annotations. Responses are evaluated via (i) relative errors in lattice parameters and density, (ii) a physics-consistency index penalizing volumetric violations, and (iii) a hallucination score capturing geometric outliers and invalid space-group predictions. These benchmarks establish a reproducible, physically informed framework for assessing generalization, consistency, and reliability in large-scale multimodal models. Dataset and implementation are available at https://github.com/KurbanIntelligenceLab/StressTestingMMFMinCR.

We present a novel visual instruction tuning strategy to improve the zero-shot task generalization of multimodal large language models by building a firm text-only knowledge base. Existing work lacks sufficient experimentation on the importance of each modality in the instruction tuning stage, often using a majority of vision-language data while keeping text-only data limited and fixing mixtures of modalities. By incorporating diverse text-only data in the visual instruction tuning stage, we vary vision-language data in various controlled experiments to investigate the importance of modality in visual instruction tuning. Our comprehensive evaluation shows that the text-heavy instruction tuning approach is able to perform on par with traditional vision-heavy mixtures on both modalities across 12 general datasets while using as low as half the total training tokens. We find that simply increasing sufficiently diverse text-only data enables transfer of instruction following ability and domain knowledge across modalities while being more efficient than the vision-language approach.

pdf bib abs
ToolReAGt: Tool Retrieval for LLM-based Complex Task Solution via Retrieval Augmented Generation
Norbert Braunschweiler | Rama Doddipatla | Tudor-catalin Zorila

Artificial intelligence agents when deployed to solve complex problems, need to first decompose the task into smaller manageable sub-tasks, and further associate tools if one is required to solve the sub-task. If the size of the set of tools to chose from is large, a retrieval system is usually employed to narrow down the tool choices before the LLM can proceed with associating tools to the sub-tasks. This paper focuses on the retrieval problem to identify the set of relevant tools to solve a complex task given a large pool of tools to chose from using retrieval augmented generation (RAG) and we refer to it as ToolReAGT. The proposed approach employs ReAct prompting to perform the retrieval in an iterative fashion to first identify if a tool is required and then associate one or more tools for each sub-task. This deviates from conventional RAG where an n-best list of tools are identified given the complex task directly. Experiments are presented on the UltraTool benchmark corpus with 1000 complex tasks and over 2000 tools to select from. A conventional RAG-system is established as baseline and compared to the ToolReAGt approach, resulting in an 8.9% improved retrieval accuracy score recall@5.

pdf bib abs
Can LLMs Recognize Their Own Analogical Hallucinations? Evaluating Uncertainty Estimation for Analogical Reasoning
Zheng Chen | Zhaoxin Feng | Jianfei Ma | Jiexi Xu | Bo Li

Large language models (LLMs) often demonstrate strong performance by leveraging implicit knowledge acquired during pretraining. Analogical reasoning, which solves new problems by referencing similar known examples, offers a structured way to utilize this knowledge, but can also lead to subtle factual errors and hallucinations. In this work, we investigate whether LLMs can recognize the reliability of their own analogical outputs using black-box uncertainty estimation (UE). We evaluate six UE metrics across two reasoning-intensive tasks: mathematical problem solving (GSM8K) and code generation (Codeforces). Our results show that Kernel Language Entropy (KLE) and Lexical Similarity (LexSim) are the most robust indicators of correctness. Moreover, while analogical prompting increases model confidence over direct prompting, most uncertainty arises during the analogy transfer step. These findings highlight the limitations of analogical knowledge transfer in LLMs and demonstrate the potential of UE methods for detecting hallucinated reasoning in black-box settings.

pdf bib abs
Meetalk: Retrieval-Augmented and Adaptively Personalized Meeting Summarization with Knowledge Learning from User Corrections
Zheng Chen | Jiang Futian | Yue Deng | Changyang He | Bo Li

We present Meetalk, a retrieval-augmented and knowledge-adaptive system for generating personalized meeting minutes. Although large language models (LLMs) excel at summarizing, their output often lacks faithfulness and does not reflect user-specific structure and style. Meetalk addresses these issues by integrating ASR-based transcription with LLM generation guided by user-derived knowledge. Specifically, Meetalk maintains and updates three structured databases, Table of Contents, Chapter Allocation, and Writing Style, based on user-uploaded samples and editing feedback. These serve as a dynamic memory that is retrieved during generation to ground the model’s outputs. To further enhance reliability, Meetalk introduces hallucination-aware uncertainty markers that highlight low-confidence segments for user review. In a user study in five real-world meeting scenarios, Meetalk significantly outperforms a strong baseline (iFLYTEK ASR + ChatGPT-4o) in completeness, contextual relevance, and user trust. Our findings underscore the importance of knowledge foundation and feedback-driven adaptation in building trustworthy, personalized LLM systems for high-stakes summarization tasks.

pdf bib abs
Theorem-of-Thought: A Multi-Agent Framework for Abductive, Deductive, and Inductive Reasoning in Language Models
Samir Abdaljalil | Hasan Kurban | Khalid Qaraqe | Erchin Serpedin

Large language models (LLMs) have shown strong performance across natural language reasoning tasks, yet their reasoning processes remain brittle and difficult to interpret. Prompting techniques like Chain-of-Thought (CoT) enhance reliability by eliciting intermediate reasoning steps or aggregating multiple outputs. However, they lack mechanisms for enforcing logical structure and assessing internal coherence. We introduce Theorem-of-Thought (ToTh), a novel framework that models reasoning as collaboration among three parallel agents, each simulating a distinct mode of inference: abductive, deductive, and inductive. Each agent produces a reasoning trace, which is structured into a formal reasoning graph. To evaluate consistency, we apply Bayesian belief propagation guided by natural language inference (NLI), assigning confidence scores to each step. The most coherent graph is selected to derive the final answer. Experiments on symbolic (WebOfLies) and numerical (MultiArith) reasoning benchmarks show that ToTh consistently outperforms CoT, Self-Consistency, and CoT-Decoding across multiple LLMs, while producing interpretable and logically grounded reasoning chains. Our findings suggest a promising direction for building more robust and cognitively inspired LLM reasoning. The implementation is available at https://github.com/KurbanIntelligenceLab/theorem-of-thought.

pdf bib abs
Reasoning or Memorization? Investigating LLMs’ Capability in Restoring Chinese Internet Homophones
Jianfei Ma | Zhaoxin Feng | Huacheng Song | Emmanuele Chersoni | Zheng Chen

Chinese homophones, prevalent in Internet culture, bring rich linguistic twists that are challenging for language models. While native speakers disambiguate them through phonological reasoning and contextual understanding, it remains untested how well LLMs perform on this task and whether LLMs also achieve this via similar reasoning processes or merely through memorization of homophone-original word pairs during training.In this paper, we present HomoP-CN, the first Chinese Internet homophones dataset with systematic perturbations for evaluating LLMs’ homophone restoration capabilities. Using this benchmark, we investigated the influence of semantic, phonological, and graphemic features on LLMs’ restoration accuracy, measured the reliance levels of each model on memorization during restoration through consistency ratios under controlled perturbations, and assessed the effectiveness of various prompting strategies, including contextual cues, pinyin augmentation, few-shot learning, and thought-chain approaches.

pdf bib abs
Superfluous Instruction: Vulnerabilities Stemming from Task-Specific Superficial Expressions in Instruction Templates
Toma Suzuki | Yusuke Sakai | Justin Vasselli | Hidetaka Kamigaito | Taro Watanabe

Large language models (LLMs) achieve high performance through instruction-tuning, which involves learning various tasks using instruction templates. However, these templates often contain task-specific expressions, which are words that frequently appear in certain contexts but do not always convey the actual meaning of that context, even if they seem closely related to the target task. Biases inherent in such instruction templates may be learned by LLMs during training, potentially degrading performance when the models encounter superficial expressions. In this study, we propose a method that incorporates additional instructions to FLAN templates, without altering the base instruction to produce “superfluous instructions”. This allows us to investigate the vulnerabilities of LLMs caused by overfitting to task-specific expressions embedded in instruction templates. The experimental results revealed that the inclusion of superficial words strongly related to each task in the instruction text can alter the output, regardless of the intended meaning.