Workshop on Towards Knowledgeable Foundation Models (2026)
up
Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026)
Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM 2026)
Canyu Chen | Yuji Zhang | Zoey Sha Li | Zihan Wang | Qineng Wang | Jinyan Su | Priyanka Kargupta | Sara Vera Marjanović | Jeff Z. Pan | Mohit Bansal | Isabelle Augenstein | Jiawei Han | Heng Ji | Manling Li
Canyu Chen | Yuji Zhang | Zoey Sha Li | Zihan Wang | Qineng Wang | Jinyan Su | Priyanka Kargupta | Sara Vera Marjanović | Jeff Z. Pan | Mohit Bansal | Isabelle Augenstein | Jiawei Han | Heng Ji | Manling Li
Annotation Frameworks Shape Model Knowledge: Safety Alignment in Large Language Models
Wajdi Zaghouani
Wajdi Zaghouani
Large language models (LLMs) are commonly described as acquiringknowledge through large scale pretraining on textual corpora.This view underestimates the epistemic consequences of post trainingsafety mechanisms. Modern LLMs undergo extensive safety alignmentvia curated datasets, human annotations, and reinforcement learningfrom human feedback (RLHF), processes that do not merely constrainoutputs but actively reshape how propositional and proceduralknowledge is accessed and expressed. We propose a conceptualframework in which safety alignment functions as a systematic formof knowledge editing at scale. Annotation frameworks used toconstruct safety datasets act as normative ontologies that partitionlanguage into categories of acceptable and unacceptable content, andalignment training propagates these distinctions into model behaviour.We introduce the Safety Knowledge Pipeline (SKP), a four stageframework describing how pretraining knowledge is progressivelyfiltered, reframed, and constrained through annotation and alignmentmechanisms. We identify three mechanisms of knowledge modification,suppression, reframing, and substitution, each with distinctdiagnostic signals, and we operationalise them in a cross lingualevaluation protocol. Throughout, we distinguish carefully betweenbehavioural claims that follow from prior empirical literature andrepresentational claims that remain open hypotheses. Case studiesspanning harmful instruction queries, hate speech annotation inArabic dialects, and culturally variable discourse illustrate theframework. We further discuss how treating annotator disagreementas a training signal rather than noise can mitigate the culturallyhegemonic effects of current alignment pipelines.
Can factual errors in language models be repaired by editing a single hidden activation at inference time?We compare blind edits, which are not told the correct answer, with oracle edits that receive answer-specific information.On Pythia-6.9B, with corruption replicated on Pythia-1B and GPT-2 XL, we find a strong break/fix asymmetry: single-layer perturbations easily corrupt correct factual recall, flipping 74-100% of initially correct answers, but blind repair is much harder.On EntityConfusion, twelve blind non-gradient interventions from four families fail to repair stable hallucinations in the strict single-layer setting; relaxed multi-layer or multi-head variants improve net accuracy by only +3 percentage points.Blind gradient optimization repairs more errors, but often breaks already-correct answers.In contrast, oracle edits given the correct answer repair many more hallucinations, fixing 68% at the default layer and up to 82% at a better layer.These results suggest that the main barrier is not whether factual recall can be steered, but whether a blind method can identify the right target-specific direction.TriviaQA is a boundary case: blind confidence maximization outperforms the single-token oracle, but the comparison is complicated because evaluation accepts multiple aliases.
What Does Alignment Cost? The Structural Brittleness of Chain-of-Thought Reasoning
Joanna Hao | Shanduojiao Jiang | Sai Asish Nakka
Joanna Hao | Shanduojiao Jiang | Sai Asish Nakka
While Chain-of-Thought (CoT) prompting enables Large Language Models to explicitly justify their predictions, the extent to which these textual rationales faithfully reflect internal computation remains unclear. We investigate the circuit-level impact of alignment by performing a strict within-family comparison of the 1B-parameter Llama 3 architecture (Base vs. Instruct). Executing dynamic circuit discovery and dual-direction resample ablation on unconstrained CoT traces across synthetic mathematical primitives and a GSM8K proxy, we find that foundation models possess highly redundant, self-repairing computational networks; completely corrupting their primary reasoning circuits yields a minimal performance drop (2.92%) due to the dynamic compensation of backup heads (the Hydra Effect). In contrast, the instruction-tuned model exhibits reduced structural redundancy, suffering more than double the degradation (6.79%) under identical perturbation. We formalize our observation as an "Alignment Tax on Redundancy": optimizing for human-preference compliance repurposes dormant backup circuits, centralizing mathematical routing and rendering the aligned model’s reasoning pathways significantly more vulnerable to internal perturbation.
bLLeQA: Benchmarking LLMs for Grounded Legal Question-Answering in French and Dutch
Nikolay Banar | Ehsan Lotfi | Jens Van Nooten | Marija Kliocaite | Walter Daelemans
Nikolay Banar | Ehsan Lotfi | Jens Van Nooten | Marija Kliocaite | Walter Daelemans
Retrieval-augmented generation (RAG) systems can play an important role in making law more accessible. However, large and reliable resources for training and benchmarking such systems remain scarce, especially for under-resourced languages like Dutch. To address this gap, and building on previous work (Louis et al., 2024), we introduce bLLeQA, a bilingual parallel question-answering dataset grounded in Belgian legal resources, both in French and Dutch. The dataset contains aligned questions, answers, and supporting articles in both languages, enabling evaluation of both retrieval and end-to-end RAG pipelines. Using bLLeQA, we benchmark the full RAG pipeline in a zero-shot setting, covering retrieval, citation extraction, refusal behavior, and generation quality. Our experiments show that open-weight models are competitive with proprietary models in retrieval and citation extraction, but lag behind in generation quality in the RAG pipeline. Across all models, refusal capability remains weak, meaning that models do not reliably detect when the provided supporting sources are incomplete. In addition, the end-to-end RAG setup still yields a substantial share of flawed responses, reaching 20% even in the best-case scenario.
VLA-Forget: Vision-Language-Action Unlearning for Embodied Foundation Models
Ravi Ranjan | Agoritsa Polyzou
Ravi Ranjan | Agoritsa Polyzou
Vision-language-action (VLA) models are emerging as embodied foundation models for robotic manipulation, but their deployment introduces a new unlearning challenge: removing unsafe, spurious, or privacy-sensitive behaviors without degrading perception, language grounding, and action control. In OpenVLA-style policies, behavior is produced through a fused visual encoder, a cross-modal projector, and a language backbone that predicts tokenized robot actions, so undesirable knowledge can be distributed across perception, alignment, and reasoning/action layers rather than confined to a single module. Consequently, partial unlearning applied only to the vision stack or only to the language backbone is often insufficient, while conventional unlearning baselines designed for standalone vision or language models may leave residual forgetting or incur unnecessary utility loss in embodied settings. We propose VLA-Forget, a hybrid unlearning framework that combines ratio-aware selective editing for perception and cross-modal specificity with layer-selective reasoning/action unlearning for utility-preserving forgetting. VLA-Forget jointly optimizes three objectives: targeted forgetting, perceptual preservation, and reasoning retention, through staged updates over the visual encoder, projector, and upper action-generating transformer blocks. Across forget-set behavior probes and retain-task evaluations, VLA-Forget improves forgetting efficacy by 10%, preserves perceptual specificity by 22%, retains reasoning and task success by 9%, and reduces post-quantization recovery by 55% relative to strong unlearning baselines.
Overcoming the Impedance Mismatch: A Theoretical Roadmap for Fusing Foundation Models and Knowledge Graphs
Sahil Rajesh Dhayalkar
Sahil Rajesh Dhayalkar
Modern artificial intelligence remains fundamentally divided between the continuous, probabilistic spaces of Foundation Models and the discrete, deterministic structures of Knowledge Graphs. While Retrieval-Augmented Generation (RAG) attempts to connect them by serializing graph data into text, we argue this lexical bridging is merely a superficial patch. In this paper, we formalize the underlying structural and geometric friction as the Impedance Mismatch. By categorizing current neuro-symbolic integration strategies into a three-tiered hierarchy, we demonstrate that neither surface-level prompt injection nor continuous representation alignment can preserve the strict logical motifs required for reliable multi-hop reasoning. We define the specific mathematical limits, such as the Lexical Bottleneck and Topological Collapse, that show current architectures will eventually hallucinate or conflate semantic nodes. To achieve true semantic fusion, we propose a rigorous theoretical roadmap. We advocate for natively internalizing discrete symbolic structures through Structured Residual Streams, utilizing Vector Symbolic Architectures for latent sub-graph injection, and performing model updates via Orthogonal Subspace Editing. This actionable framework paves the way for models that seamlessly fuse the precision of symbolic logic with the expressivity of parametric memory.
LLM-MemCluster: Empowering Large Language Models with Dynamic Memory for Text Clustering
Yuanjie Zhu | Liangwei Yang | Ke Xu | Weizhi Zhang | Zihe Song | Jindong Wang | Philip S. Yu
Yuanjie Zhu | Liangwei Yang | Ke Xu | Weizhi Zhang | Zihe Song | Jindong Wang | Philip S. Yu
Large Language Models (LLMs) are reshaping unsupervised learning by offering an unprecedented ability to perform text clustering based on their deep semantic understanding. However, their direct application is fundamentally limited by a lack of stateful memory for iterative refinement and the difficulty of managing cluster granularity. As a result, existing methods often rely on complex pipelines with external modules, sacrificing a truly end-to-end approach. We introduce LLM-MemCluster, a novel framework that reconceptualizes clustering as a fully LLM-native task. It leverages a Dynamic Memory to instill state awareness and a Dual-Prompt Strategy to enable the model to reason about and determine the number of clusters. Evaluated on several benchmark datasets, our tuning-free framework significantly and consistently outperforms strong baselines. LLM-MemCluster presents an effective, interpretable, and truly end-to-end paradigm for LLM-based text clustering.
Dense retrievers excel at first-stage candidate generation but lack effective reranking in zero-resource settings. Existing approaches face a fundamental dilemma: cross-encoders deliver strong reranking quality but require costly supervised training and incur high latency, while unsupervised BM25 reranking consistently degrades dense retrieval performance on most of BEIR benchmarks. We propose DART (Dense Adaptive Reranking at Test-time), which resolves this dilemma by adapting the scoring function at inference time. For each query, the top-ranked documents serve as pseudo-positive examples and the bottom-ranked as pseudo-negative examples, providing noisy but readily available supervision to adapt a bilinear scoring matrix W via a small number of gradient updates. We further introduce a confidence-weighted margin loss and a cross-query momentum buffer that warm-starts adaptation across queries. On six BEIR benchmarks, DART achieves a mean per-dataset relative NDCG@10 gain of +2.1% over the dense retrieval baseline with under 10ms additional latency per query, demonstrating a powerful capability for zero-shot performance enhancement and cross-domain generalization.
Multimodal Generative Engine Optimization: Rank Manipulation for Vision–Language Model Rankers
Yixuan Du | Chenxiao Yu | Haoyan Xu | Ziyi Wang | Yue Zhao | Xiyang Hu
Yixuan Du | Chenxiao Yu | Haoyan Xu | Ziyi Wang | Yue Zhao | Xiyang Hu
Vision-Language Models (VLMs) integrate visual and textual knowledge into unified representations that increasingly underpin modern retrieval and recommendation systems. However, it remains unclear how reliably these models utilize their cross-modal knowledge when ranking multimodal items, and whether their knowledge grounding can be subverted. In this paper, we expose a fundamental vulnerability in how VLMs apply multimodal knowledge for product ranking: through Multimodal Generative Engine Optimization (MGEO), we show that an adversary can manipulate a VLM’s ranking decisions by jointly crafting imperceptible image perturbations and fluent textual suffixes that exploit the model’s internal cross-modal knowledge coupling. Using an alternating optimization strategy, MGEO targets the deep interactions between visual and linguistic representations within the VLM, achieving rank manipulations that substantially exceed those of unimodal attacks and heuristic baselines powered by strong commercial models. Our findings reveal that surface-level content quality is insufficient for rank promotion; instead, direct alignment with the model’s internal knowledge utilization mechanism is required. These results raise important questions on the faithfulness and robustness of knowledge grounding in multimodal foundation models, and motivate future work on defense mechanisms for multimodal retrieval systems.
Beyond Retrieval: Bi-Temporal State Arbitration for Longitudinal Healthcare Agents
Jianing Zhao | Xiaoquan Zhi | Xinqiang Yu
Jianing Zhao | Xiaoquan Zhi | Xinqiang Yu
Longitudinal healthcare agents require persistent state tracking under temporal uncertainty. In domains like chronic disease management, patient states—medications, symptoms, and vital signs—evolve continuously over months. Existing memory architectures for Large Language Models (LLMs) are inherently retrieval-centric: they treat memory as a static repository of past interactions, failing to resolve conflicting or superseded information when queried for the current patient state. We propose a shift to state-centric memory. Our framework introduces (1) a bi-temporal state representation that decouples event time from ingestion time and tracks temporal validity windows, (2) an incremental state arbitration mechanism using four operators—SUPPORT, REFINE, SUPERSEDE, and BRANCH-CONFLICT—to handle evolving medical facts without destructive overwriting, and (3) a confidence-thresholded evidence escalation layer for robust, efficient memory access. Evaluated on a longitudinal diabetes management suite as a representative biomedical state tracking task, our method achieves a Unique-F1 of 0.85 and Conflict-F1 of 0.98, substantially improves upon long-context LLMs (0.38 / 0.89) and standard vector memory (0.30 / 0.60), demonstrating that agentic AI in longitudinal biomedical settings requires continuous, evidence-grounded arbitration rather than simple retrieval.
RSCE: Training-Free Residual Stream Encoding for Persistent Context Amortization
Adam Kamel | Eric Xu
Adam Kamel | Eric Xu
A central question in the knowledge lifecycle of language models ishow externally injected signals interact with parametric memoryaccumulated during pretraining. We address this through ResidualStream Context Encoding (RSCE), a training-free method that encodesa context document ctx into a single vector C ∈ ℝdMvia mean-pooling residual stream activations at a calibratedintermediate layer, then injects C as an additive shift at querytime. This replaces O(|T(ctx)|) attention prefill with an O(1)operation and reveals a previously undescribed dual-pathwayinterference effect: vector injection alone suppresses parametricrecall below the question-only baseline across four of fivetested architectures. This finding—absent in behavioral activationsteering—provides mechanistic evidence that LLMs maintain separatecontextual-retrieval and parametric-recall pathways that compete whenexternally injected signals are semantically rich but token-precisiondeficient. A dual-channel design pairing C with a compact explicitfact block F resolves this tension. We evaluate five decoder-onlyarchitectures (7B–70B) on multi-document QA (LongBench, n=108)and six on cross-file code completion (RepoBench-C), comparingagainst LongLLMLingua and EHPC. At extreme compression (∼99%token reduction), RSCE Vec+F is competitive with EHPC on smallerarchitectures (LLaMA-8B F1 0.333 vs. EHPC 0.334; DeepSeek-14Bboth 0.214) while both substantially outperform LongLLMLingua.RSCE is the only method achieving 81% compression at 100%operational reliability on code.
Tricking Open-World Object Recognition Models: Uncertainty in Out-of-Distribution Detection
Wout Teillers | Matias Valdenegro-Toro
Wout Teillers | Matias Valdenegro-Toro
Object recognition models are well studied on benchmark datasets, typically focusing on performance in retrieving objects that exist in images. However, in real-life scenarios there is no prior knowledge of an object’s existence, and current research fails to assess model performance in these situations. This research aims to shed light on this problem by testing three Open-World models, YOLO-World, Grounding Dino and GPT-4o, on the LVIS, Open Images, and JUS datasets. We design an experiment where models are confronted with impossible prompts by instructing them to retrieve non-existing objects. This allows us to observe the models’ uncertainty performance. Overall, GPT-4o performed poorest with regard to object recognition and uncertainty estimation. GPT-4o showed to be highly overconfident. In contrast, YOLO-World and Grounding Dino are slightly underconfident, but they are superior in their uncertainty calibration in comparison to GPT-4o. However, all three models occasionally assign high confident predictions to non-existing objects. Showing that improvement can still be made to the uncertainty estimation of these models when confronted with impossible prompts.
Knowledge Localization and Editability in Small Language Models: A Multi-Stage Experimental Study
Pranamya Nilesh Deshpande | Aiswarya Konavoor | Sreedath Panat
Pranamya Nilesh Deshpande | Aiswarya Konavoor | Sreedath Panat
The internal mechanisms by which transformer-based language models encode and retrieve factual knowledge remain poorly understood, particularly for small language models (SLMs) operating in the 2–3 billion parameter range. This paper presents a systematic, multi-stage empirical investigation into knowledge localization, compression effects, and knowledge editability across four SLMs—Gemma-2B, Llama-3.2-3B-Instruct, Qwen-2.5-3B-Instruct, and Phi-2—with Meta-Llama-3-8B serving as a large-model baseline. Stage 1 employs causal tracing with activation patching on the CounterFact dataset (~450–500 validated facts per model) to identify the layer or layers most causally responsible for factual recall. Stage 2 compares knowledge density, layer concentration, and redundancy between the 2–3B models and the 8B baseline to quantify the structural effects of model compression on knowledge storage. Stage 3 applies the Rank-One Model Editing (ROME) algorithm at the causally identified layers to assess whether localized knowledge can be reliably overwritten. Our results demonstrate that (i) factual knowledge in SLMs concentrates in upper-to-final transformer layers, with Llama-3B exhibiting extreme concentration in layer 28; (ii) compressed models store knowledge more densely per parameter but with substantially lower redundancy (Llama-3B: 0.047 vs. Llama-8B: 0.468); and (iii) editing success correlates strongly with architectural concentration rather than model size, with Llama-3B achieving 85.7% editing success versus 33% for Gemma-2B. These findings carry direct implications for interpretability, model editing, and the design of future small language model architectures.
One Retrieval to Cover Them All: Co-occurrence-Aware Knowledge Base Reorganization for Session-Level RAG
Shivam Ratnakar | Yixuan Zhu | Cecilia Cheng | Chaya Vijayakumar
Shivam Ratnakar | Yixuan Zhu | Cecilia Cheng | Chaya Vijayakumar
RAG systems retrieve documents optimized for answering *one query at a time*. Yet enterprise users arrive with *sessions*, that is, coherent episodes of related questions that span semantically distant parts of the knowledge base. We show that a single retrieval call over a standard knowledge base covers only 41% of a user’s session-level information need. To close this gap, we reorganize the KB offline using co-occurrence-aware clustering and expand retrieval candidates through cluster neighborhoods at query time. On WixQA (6,221 enterprise support articles), our method raises single-query session coverage to 58% (+17% absolute; 95% CI: [14.1, 20.4]), reduces retrieval calls to 70% coverage by 34%, and compresses the KB to 20% of its original size, all consistently across four embedding models and six functional domains. We argue that session-level coverage, not single-query recall, should be the primary metric for enterprise RAG evaluation.