Workshop on Trustworthy Natural Language Processing (2026)

Volumes

Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026) 42 papers

pdf (full)
bib (full) Proceedings of the 6th Workshop on Trustworthy NLP (TrustNLP 2026)

pdf bib abs

Evaluating Cross-Lingual Behavior and Consistency of Multimodal Large Language Models
Hao Wang | Pinzhi Huang | Daisuke Kawahara

The rapid evolution of multimodal large language models (MLLMs) has significantly enhanced their real-world applications.However, achieving consistent performance across languages, especially when integrating cultural knowledge, remains a significant challenge.To better assess this issue, we introduce two new benchmarks: KnowRecall and VisRecall, which evaluate cross-lingual consistency in MLLMs.KnowRecall is a visual question answering benchmark designed to measure factual knowledge consistency in 15 languages, focusing on cultural and historical questions about global landmarks.VisRecall assesses visual memory consistency by asking models to describe landmark appearances in 9 languages without access to images.Experimental results reveal that state-of-the-art MLLMs, including proprietary ones, still struggle to achieve cross-lingual consistency.This underscores the need for more robust approaches that produce truly multilingual and culturally aware models.

pdf bib abs

Quantization methods are widely used to accelerate inference and streamline the deployment of large language models (LLMs). Although quantization’s effects on various LLM capabilities have been extensively studied, one critical area remains underexplored: factual knowledge recall (FKR), the process by which LLMs access stored knowledge. To this end, we conduct comprehensive experiments using three common quantization techniques at distinct bit widths, in conjunction with interpretability-driven analyses on two tasks, knowledge memorization and latent multi-hop reasoning. We show that quantization typically results in information loss within LLMs, consequently diminishing their capacity for FKR. This effect is particularly amplified in smaller models within the same architectural families. However, models quantized at reduced bit precision do not consistently exhibit inferior performance and occasionally quantization may even enhance model FKR. We find that BitSandBytes demonstrates highest preservation of the original full-precision model’s FKR. Despite variability across models and methods, quantization causes modest performance degradation and remains an effective compression strategy.

pdf bib

Uncertainty-Aware Proxy Attribute Reasoning for Reliable Media Bias Detection
Chin-Po Chen | Jeng-Lin Li | Ming-Ching Chang

pdf bib abs

Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis
Zvi Topol

Large language models (LLMs) are increasingly deployed in wide range of applications, yet remain vulnerable to adversarial jailbreak attacks that circumvent their safety guardrails.Existing evaluation frameworks typically report binary success/failure metrics, failing to capture the temporal dynamics of how attacks succeed under persistent adversarial pressure. This preliminary work proposes a novel evaluation framework that applies survival analysis techniques to characterize LLM jailbreak vulnerability. Our approach models the “time-to-jailbreak” as a survival outcome, enabling estimation of hazard functions, survival curves, and risk factors associated with successful attacks. We evaluate three LLMs against a sub-set of prompts from the HarmBench dataset spanning three attack categories. Our analysis reveals that models exhibit distinct vulnerability profiles: while one model demonstrates rapid degradation under iterative attacks, the wo other models show consistent moderate vulnerability. Our framework provides actionable insights for model and LLM applicaiton developers and establishes survival analysis as a rigorous methodology for LLM safety evaluation.

pdf bib abs

ClaimCLAIRE: A Trust-Aware Multi-Component Fact-Checking Agent for Open-World Claims
Xinman Liu | Mayank Sharma

Verifying complex real-world claims against diverse and potentially unreliable open-web sources requires balancing evidence comprehensiveness with rigorous source reliability. Current automated fact-checking approaches often fail to address this holistically, losing contextual dependencies and applying trust signals monolithically at the document level.We introduce ClaimCLAIRE, a multi-component fact-checking agent that integrates four key innovations: (1) iterative component-aware decomposition with exhaustiveness validation, (2) holistic evidence gathering using a ReAct agent that maintains cross-component semantic awareness, (3) trust-modulated retrieval that weights evidence by source credibility to mitigate the influence of misinformation, and (4) adaptive gap-filling to address recall bottlenecks in under-supported sub-claims.Evaluated on the AVeriTeC benchmark, ClaimCLAIRE achieves 84.27% accuracy and a macro-F1 of 0.806. Our systematic ablations demonstrate that while decomposition alone can degrade performance, its integration with trust-aware retrieval and adaptive gap-filling yields a pipeline where component-level verdicts, source trust ratings, and deterministic AND-logic synthesis together support transparent, accountable fact verification.

pdf bib abs

ChatbotManip: a Dataset to Facilitate Evaluation and Oversight of Manipulative Chatbot Behaviour
Jack Luigi Henry Contro | Simrat Deol | Martim Brandao | Yulan He

This paper introduces ChatbotManip, a novel dataset for studying manipulation in Chatbots. It contains simulated generated conversations between a chatbot and a (simulated) user, where the chatbot is explicitly asked to showcase manipulation tactics, persuade the user towards some goal, or simply be helpful. We consider a diverse set of chatbot manipulation contexts, from consumer and personal advice to citizen advice and controversial proposition argumentation. Each conversation is annotated by human annotators for both general manipulation and specific manipulation tactics. Our research reveals three key findings. First, Large Language Models (LLMs) can be manipulative when explicitly instructed, with annotators identifying manipulation in approximately 84% of such conversations. Second, even when only instructed to be "persuasive" without explicit manipulation prompts, LLMs frequently default to controversial manipulative strategies, particularly Gaslighting and Fear Enhancement. Third, zero-shot larger models such as Gemini 2.5 pro have the best performance in detecting manipulation (of the models tested), with more work required to fine-tune smaller open source models for real-world on-device oversight. Our work provides important insights for AI safety research and highlights the need of addressing manipulation risks as LLMs are increasingly deployed in consumer-facing applications.

pdf bib abs

Controllable Pareto Trade-off between Fairness and Accuracy
Yongkang Du | Jieyu Zhao | Yijun Yang | Tianyi Zhou

The fairness-accuracy trade-off is a key challenge in NLP tasks. Current work focuses on finding a single optimal solution to balance the two objectives, which is limited considering the diverse solutions on the Pareto front.This work intends to provide controllable trade-offs according to the user’s preference of the two objectives, which is defined as a reference vector. To achieve this goal, we apply multi-objective optimization (MOO), which can find solutions from various regions of the Pareto front. However, it is challenging to precisely control the trade-off due to the stochasticity of the training process and the high dimensional gradient vectors.Thus, we propose Controllable Pareto Trade-off (CPT) that can effectively train models to perform different trade-offs according to users’ preferences.CPT 1) stabilizes the fairness update with a moving average of stochastic gradients to determine the update direction, and 2) prunes the gradients by only keeping the gradients of the critical parameters. We evaluate CPT on hate speech detection and occupation classification tasks. Experiments show that CPT can achieve a higher-quality set of solutions on the Pareto front than the baseline methods. It also exhibits better controllability and can precisely follow the human-defined reference vectors.

pdf bib abs

As the influence of LLMs expands, it is imperative to gain insight into their decisions. One way to do that is to develop probes that detect the presence or absence of a broad set of high-level abstract concepts within the embeddings computed in an LLM - which is what we might say a model is "thinking" about. Such probes should be low-cost and easily applicable to any LLM, so that monitoring for many concepts is possible during normal operation.In this paper, we take the first steps towards developing the capability of creating many such probes by defining and executing examples of the key tasks needed: first, the careful delineation of a high-level abstract concept through the creation of a dataset with the concept both present and then absent. Then, the training and testing of a set of linear probes to detect the concept on any layer of an LLM, including an exploration of the complexity of the probe needed. Finally, we show that such probes can track concepts across larger contexts. This is done with four separate concepts and three different LLMs. When this process is scaled to many more concepts, it will create the ability to monitor new models.

pdf bib abs

Hair-Trigger Alignment: Black-Box Evaluation Cannot Guarantee Post-Update Alignment
Yavuz Faruk Bakman | Duygu Nur Yaldiz | Salman Avestimehr | Sai Praneeth Karimireddy

Large Language Models (LLMs) are rarely static and are frequently updated in practice. A growing body of alignment research has shown that models initially deemed “aligned” can exhibit misaligned behavior after fine-tuning, such as forgetting jailbreak safety features or re-surfacing knowledge that was intended to be forgotten. These works typically assume that the initial model is aligned based on static black-box evaluation, i.e., the absence of undesired responses to a fixed set of queries. In contrast, we formalize model alignment in both the static and post-update settings and uncover a fundamental limitation of black-box evaluation. We theoretically show that, due to overparameterization, static alignment provides no guarantee of post-update alignment for any update dataset. Moreover, we prove that static black-box probing cannot distinguish a model that is genuinely post-update robust from one that conceals an arbitrary amount of adversarial behavior, which can be activated by even a single benign gradient update. We further validate these findings empirically in LLMs across three core alignment domains: privacy, jailbreak safety, and behavioral honesty. We demonstrate the existence of LLMs that pass all standard black-box alignment tests, yet become severely misaligned after a single benign update. Finally, we show that the capacity to hide such latent adversarial behavior increases with model scale, confirming our theoretical prediction that post-update misalignment grows with the number of parameters. Together, our results highlight the inadequacy of static evaluation protocols and emphasize the urgent need for post-update–robust alignment evaluation

pdf bib abs

People often rely on large language models (LLMs) in situations where they are ill-suited. This miscalibration is understandable: seeing LLMs compose poetry and answer complex questions can lead users to assume, incorrectly, that they will also handle simple tasks, such as basic arithmetic, without error. Prior work has attempted to address this issue by clustering instance embeddings to identify regions where an LLM is likely to fail, then automatically describing the patterns within those regions. These inferred “failure patterns” are taught to users to reduce overreliance. Yet, this approach has not been fully successful. In this paper, we investigate why.We first examine whether the negative results stem from an absence of meaningful failure patterns. Using two datasets, we group instances by their meta-labels and evaluate LLM performance within each group. We then define criteria to identify groups that are both sufficiently large and exhibit high error rates. This process reveals multiple meta-label groups that meet these criteria, indicating that actionable failure patterns do, in fact, exist. Next, we test whether prompting- and embedding-based methods can reliably surface these known failure patterns. This step is critical: if such patterns cannot be surfaced automatically, they cannot be communicated to users. We observe mixed performance across methods, which may explain the limited success of prior approaches. Finally, we revisit how teaching effectiveness is measured. We propose evaluating whether users can apply learned failure patterns to anticipate when an LLM is likely to err. A user study shows that instruction based on this metric yields measurable improvements, unlike standard human–AI team accuracy metrics. Overall, our findings suggest that teaching failure patterns can be an effective way to mitigate overreliance, but its success depends on improved automated methods for discovering these patterns and on evaluation metrics like ours.

pdf bib abs

Linear Probes Detect Task Format, Not Reasoning Mode in Language Model Hidden States
Subramanyam Sahoo | Vinija Jain | Aman Chadha | Divya Chaudhary

Linear probing of large language model (LLM) hidden states is widely used to claim that models learn distinct representations for different reasoning types. We test this by probing Qwen3-14B on three benchmarks spanning the classical trichotomy: LogiQA 2.0 (deductive), ARC-Challenge (inductive), and 𝛼NLI (abductive). At layer 32 of 40, linear probes achieve 100% cross-validated accuracy with well-separated geometry (intrinsic dimensionalities: 20.6, 28.5, 33.6; convex hull contamination ≤1.5%). However, this separation is entirely driven by format confounds. Residualizing source identity, option count, and response length reduces accuracy to chance. Trace-anchor similarity indicates largely shared reasoning across tasks (42.5% agreement vs. 33.3% chance), and causal steering with random controls (n=20) shows no functional link between geometry and reasoning mode (p=0.286). Thus, high probe accuracy reflects task format rather than computational structure, motivating routine format deconfounding in mechanistic interpretability.

pdf bib abs

KoLegalQA: A Korean Legal QA Dataset for Trustworthy and Explanation-Grounded Legal AI
Yongtae Lee | Surin Lee | Sumin Kim | S M Wahidur Rahman | Heung-No Lee

Legal QA systems may benefit from training data that is expert-verified and associated with statutory provisions, as fluent generation alone cannot guarantee legally relevant and citation-supported outputs. However, existing Korean legal datasets provide limited support for legal QA and statute-associated response generation. To address this gap, we introduce KoLegalQA, a large-scale Korean legal question–answer corpus designed for research on legal QA and explanation-oriented legal response generation in real-world consultation scenarios. The dataset comprises 19k consultations collected from government-operated services, with all responses originally authored or verified by licensed legal professionals. Unlike prior resources, KoLegalQA provides explicit statutory references and clause-level summaries, enabling research on citation-associated and explanation-oriented legal response generation. We benchmark six Korean-capable LLMs using both automated evaluation (G-Eval) and human assessment across multiple criteria, including legal correctness, reasoning quality, and citation relevance. Experimental results show that fine-tuning on KoLegalQA generally improves legal reasoning validity and statute-associated response generation across most evaluated models. We present this resource as a practical benchmark dataset for Korean legal NLP research. Dataset splits, preprocessing scripts, and evaluation code will be publicly released to support reproducible research.

pdf bib abs

Authorization-First Retrieval: Enforcing Least Privilege in Multi-Agent RAG Systems
Rohith Namboothiri

Retrieval-augmented generation systems serving multiple users under role-based access control face a trustworthiness gap: semantic retrieval operates on embedding similarity rather than authorization predicates and can introduce unauthorized content into a model’s context window before any filter intervenes. We formalize this as a pipeline ordering problem and introduce Authorization-First Retrieval (AFR), an architectural invariant requiring that authorization constrain the retrieval candidate set before any learned component consumes retrieved content. We reduce authorization correctness to the classical noninterference property and prove AFR is necessary whenever the processing model violates noninterference—a condition our experiments confirm empirically. Evaluation on a controlled corpus of 247 chunks across 232 documents with 431 base queries spanning 12 enterprise roles and 9 domains (584 total queries including negation exploitation and parametric probes) shows that retrieve-then-filter pipelines expose unauthorized context in 86.1% of queries, while AFR eliminates structural leaks by construction. Cross-model experiments with Gemini 2.0 Flash and GPT-4o-mini reveal that structural exposure is an architectural property independent of the underlying model, whereas behavioral defenses fail at model-dependent rates, producing answer leakage of 41.3% and 29.5% respectively under retrieve-then-filter. A negation exploitation study demonstrates consistent disclosure vulnerabilities across framing types, while a metadata-tag freshness ablation shows that conditional authorization mechanisms degrade under realistic policy staleness. Stress tests across retrieval depths and chunking granularities confirm AFR’s robustness. Our results demonstrate that behavioral guardrails and metadata tagging cannot reliably enforce least privilege in RAG pipelines, while authorization-first architectures provide a verifiable and model-independent security guarantee.

pdf bib abs

PII Jailbreaking in LLMs via Activation Steering Reveals Personal Information Leakage
Krishna Kanth Nakka | Xue Jiang | Dmitrii Usynin | Xuebing Zhou

This paper investigates privacy jailbreaking in large language models (LLMs) via steering, examining whether targeted manipulation of internal activations can circumvent the alignment mechanisms and alter model behaviour on privacy-sensitive queries, such as those concerning sexual orientation of public figures. Our approach begins by identifying attention heads predictive of refusal behaviour for a given private attribute, using lightweight linear probes trained on labels provided by a privacy evaluator. We then apply steering to a carefully selected subset of these heads, guided by the probe outputs, to induce positive responses from the model. Empirical results demonstrate that these steered responses frequently reveal the target attribute, as well as additional personal information about the data subject, including life events, relationships, and biographical details. Evaluations across three LLMs show that steering achieves disclosure rates of at least 80% with several responses containing real personal information. This controlled study highlights a concrete privacy risk: personal information memorised during pre-training can be extracted through targeted activation-level interventions, without reliance on computationally intensive adversarial prompting techniques.

pdf bib abs

Coercion Suppression Increases Preference Hallucinations via a Deceptive Bypass in K-Level Negotiation Agents
Jihye Kim

K-Level reasoning—recursive modeling of opponent beliefs—improves LLM negotiation utility but frequently elicits coercive and toxic behaviors that undermine real-world deployability. We propose an Observer–Planner–Actor architecture with a Modular Appraisal Gate that (i) dynamically estimates the opponent’s cognitive level and (ii) filters hostile drafts via an LLM-as-a-judge. In randomized interventions on the CaSiNo dataset, our gated agent eliminates toxicity (0%) and reduces coercion from 35% to 6% compared to a strong static-K baseline, albeit with an alignment tax in utility. However, the gate does not reduce preference hallucinations—strategic misrepresentation of the agent’s own priorities. K-Level reasoning incidentally suppresses this behavior (from 35% in a vanilla baseline to 22%), but gating coercion releases the suppression, returning hallucination to vanilla-baseline levels (33–37%). We term this pattern a deceptive bypass: output-level filters address the form of hostility but leave surface-compliant manipulation channels intact, demonstrating that they alone are insufficient to align utility-driven strategic agents.

pdf bib abs

Purdah and Patriarchy: Evaluating and Mitigating South Asian Biases in Open-Ended Multilingual LLM Generations
Mamnuya Rinki | Chahat Raj | Anjishnu Mukherjee | Ziwei Zhu

Evaluations of Large Language Models (LLMs) often overlook intersectional and culturally specific biases, particularly in underrepresented multilingual regions like South Asia. This work addresses these gaps by conducting a multilingual and intersectional analysis of LLM outputs across 10 Indo-Aryan and Dravidian languages, identifying how cultural stigmas influenced by purdah and patriarchy are reinforced in generative tasks. We construct a culturally grounded bias lexicon capturing previously unexplored intersectional dimensions including gender, religion, marital status, and number of children. We use our lexicon to quantify intersectional bias and the effectiveness of self-debiasing in open-ended generations (e.g., storytelling, hobbies, and to-do lists), where bias manifests subtly and remains largely unexamined in multilingual contexts. Finally, we evaluate two self-debiasing strategies (simple and complex prompts) to measure their effectiveness in reducing culturally specific bias in Indo-Aryan and Dravidian languages. Our approach offers a nuanced lens into cultural bias by introducing a novel bias lexicon and evaluation framework that extends beyond Eurocentric or small-scale multilingual settings.

pdf bib abs

Ghost Context: Measuring Cross-Context Interference in Long-Context Language Models
Rohith Namboothiri

Long-context language models assemble prompts from heterogeneous sources, and deployed systems implicitly trust the model to use the correct span of context. We show that this assumption is often violated: irrelevant spans can silently shape outputs, producing errors that are neither fabrication nor omission but misattributed grounding—claims supported by the wrong part of the input context. Unlike intrinsic hallucination (contradicting the source) or extrinsic hallucination (introducing unsupported claims), misattributed grounding uses real evidence from an incorrect span, making it invisible to standard source-blind faithfulness metrics.We formalize this phenomenon as Ghost Context and introduce a causal mask-and-rerun attribution protocol to measure it. Across a 272-case corpus spanning multiple interference scenarios, we evaluate three widely used models and report two complementary signals: strict Ghost Context Rate (GCR), which captures verifiable factual misattribution, and open-ended influence, which captures broader contextual shaping effects. Under realistic contextual conflict, strict GCR spikes substantially: temporal contradictions trigger misattributed grounding in 38.3% of cases. Across all scenarios, open-ended distractor influence occurs in 20.4% of evaluations.Importantly, Ghost Context is not only detectable but also remediable. Masking the single highest-attributed distractor span resolves 95.5% of detected errors (Fix@1) with 2.4% collateral damage and zero false positives on negative controls. We also introduce Contextual Invariance Rate (CIR) as a system-level robustness metric measuring invariance to irrelevant context.Our findings show that contextual conflict—common in retrieval-augmented generation and agent systems—can systematically degrade reliability, but also reveal that Ghost Context errors are causally localizable and cheaply correctable. We release the evaluation corpus, detection pipeline, and experimental results to support further research on trustworthy long-context language model evaluation.

pdf bib abs

Understanding the Effects of Safety Unalignment on Reasoning- and Instruction-Tuned Large Language Models
John Timothy Halloran

Alignment has become a critical step towards enabling large language model (LLM) safety guardrails which ensure models provide helpful and harmless responses, while refusing malicious and harmful requests. However, two separate lines of recent work–unalignment via fine-tuning, i.e., jailbreak-tuning (JT), and weight orthogonalization (WO)–have shown that LLM guardrails may be circumvented, such that LLMs obey harmful requests which they would normally refuse. Despite the safety implications of such unalignment procedures, a comprehensive analysis directly contrasting these methods is currently lacking, as is a study of these methods’ impact on malicious LLM capabilities and reasoning models. Using both JT and WO, we study the impact of unaligning six popular LLMs–three reasoning LLMs of various sizes and their instruction-tuned analogues–across harmful safety tasks. Compared to JT, we show that WO produces models which are more effective at adversarially attacking LLMs–in particular, WO reasoning LLMs excel at such adversarial attacks. Interestingly, while increasing adversarial attack efficacy, we show that WO does not drastically increase hallucination rates. This is in stark contrast to JT, which may more than double the hallucination rate of both reasoning and instruction-tuned models alike. Finally, we show that off-the-shelf supervised fine-tuning effectively limits the adversarial attack abilities enabled by WO, without drastically increasing hallucination rates.

pdf bib abs

Task-oriented dialogue systems—handling transactions, reservations, and service requests—require predictable behavior, yet the moderately-sized LLMs needed for practical latency are prone to hallucination and format errors that cascade into incorrect actions (e.g., a hotel booked for the wrong date). We propose ReacTOD, a bounded neuro-symbolic architecture that reformulates NLU as discrete tool calls within a self-correcting ReAct loop governed by deterministic validation. A bounded ReAct loop enables iterative self-correction, improving accuracy by up to 9.3 percentage points over single-pass inference on MultiWOZ. A symbolic validator enforces action compliance, schema conformance, and coreference consistency on every dialogue state update, achieving a 93.1% self-correction rate on intercepted errors and producing structured execution traces. Incremental state prediction and on-demand history retrieval keep prompts compact, empirically improving instruction adherence in parameter-constrained models. On MultiWOZ 2.1, ReacTOD achieves a new zero-shot state-of-the-art: gpt-oss-20B reaches 52.71% joint goal accuracy, surpassing the previous best by 14 percentage points, while Qwen3-8B achieves 47.34% with only 8B parameters. On the Schema-Guided Dialogue (SGD) benchmark, ReacTOD with Claude-Opus-4.6 achieves 80.68% JGA under fully end-to-end evaluation with predicted domains, and Qwen3-32B reaches 64.09%—demonstrating cross-benchmark generalization without task-specific training data.

pdf bib abs

Geometric Deviation as an Unsupervised Pre-Generation Reliability Signal: Probing LLM Representations for Answerability
Yucheng Du

A reliable language model should be able to signal, prior to generation, when a query falls outside its knowledge. We investigate whether representation geometry can provide such a pre-generation signal by measuring the deviation of hidden states from an answerable reference set, requiring no labeled failure data and no access to model outputs.Across three instruction-tuned models (Llama 3.1-8B, Qwen 2.5-7B, and Mistral-7B-Instruct) and three prompt forms (Math, Fact, Code), we find that geometry primarily encodes task form. Within mathematical prompts, unanswerable inputs consistently deviate from the answerable centroid, yielding strong separation (ROC-AUC 0.78-0.84). This single-pass pre-generation signal outperforms a simple refusal baseline and compares favorably to self-consistency. It also captures cases where models do not explicitly refuse.In contrast, no reliable geometric signal emerges for factual prompts, indicating that the effect is form-conditional rather than universal. Code prompts show large effect sizes with higher variance, suggesting partial generalization beyond mathematical form.A layer-wise analysis reveals that the signal arises in early layers and gradually attenuates toward the output. These results suggest that answerability-related geometry is established before the final stages of generation. Together, these findings indicate that geometric deviation can serve as a lightweight pre-generation signal that is reliable in structured domains with formal answerability constraints, with clear boundaries on where it generalizes.

pdf bib abs

Sparse autoencoders (SAEs) enable feature-level mechanistic interpretability and activation steering in large language models (LLMs), but SAE-based language control remains unreliable in multilingual settings: most SAEs are trained on English-only data, and steering layers are chosen heuristically. We address these limitations by advancing a principled, mechanistic account of multilingual language steering with SAEs. First, we show that training SAEs on multilingual data consistently strengthens cross-lingual representations and yields more reliable, quality-preserving language control across layers and model families. Second, we introduce an a priori steering layer-selection rule based on the intersection of multilingual alignment and language separability, which predicts effective intervention depths without exhaustive layerwise search. We evaluate our approach on LLaMA-3.1-8B and Gemma-2-9B across machine translation and cross-lingual summarization (CrossSumm), using SpBLEU, ROUGE-L, COMET, and LaSE. Our results show that multilingual SAEs combined with intersection-selected layers stabilize the trade-off between language identification accuracy and generation quality, providing a principled, predictive, representation-level account of multilingual SAE steering.

pdf bib abs

Deactivating Refusal Triggers: Understanding and Mitigating Overrefusal in Safety Alignment
Zhiyu Xue | Zimo Qi | Guangliang Liu | Bocheng Chen | Ramtin Pedarsani

Safety alignment aims to ensure that large language models (LLMs) refuse harmful requests by post-training on harmful queries paired with refusal answers.Although safety alignment is widely adopted in industry, the overrefusal problem where aligned LLMs also reject benign queries after safety alignment post-training, remains insufficiently studied. Such an issue degrades the usability of safety alignment in real-world applications.In this paper, we examine how overrefusal arises under safety alignment, and propose a mitigation strategy inspired by our findings. We define refusal triggers as linguistic cues in the training data that elicit refusal responses, safety alignment encourages LLMs to associate refusal triggers within a training sample with refusal responses, leading aligned LLMs to refuse harmful queries.However, the refusal triggers include not only harmful linguistic cues but also non-harmful cues, therefore causing overrefusal to benign queries.Building on this mechanistic analysis, we propose a method that explicitly considers refusal triggers in the safety alignment fine-tuning.Empirical results demonstrate that our approach achieves a more favorable trade-off between defense against jailbreak attacks and responsiveness to benign queries, outperforming prior methods. Warning: this paper contains harmful and biased sentences.

pdf bib abs

A Systematic Taxonomy of Failure Modes in Retrieval-Augmented Generation Systems
Anupama Garani

Retrieval-Augmented Generation (RAG) systems fail in diverse, poorly characterized ways that single-stage evaluation metrics cannot detect. We present a systematic taxonomy of 33 failure modes across 7 pipeline stages — ingestion, representation, retrieval, generation, evaluation, deployment, and agentic orchestration — constructed through a structured literature review of 48 sources spanning peer-reviewed publications and high-impact preprints. For each mode, we provide a formal definition, observable manifestation, and three-level evidence grading (Strong/Moderate/Limited). Our analysis reveals a critical asymmetry in research attention: retrieval and generation failures are comparatively well-studied, while representation, evaluation, and agentic orchestration failures remain under-investigated despite frequent occurrence in production. We identify 12 failure modes with no dedicated peer-reviewed empirical evidence — all 8 agentic modes among them — constituting an evidence desert in the fastest-growing RAG deployment paradigm. Compared to prior work enumerating 7 failure points (Barnett et al., 2024) or 16 error types within partial pipeline runs (Cresswell et al., 2025), our taxonomy uniquely spans the full pipeline, including agentic orchestration with explicit evidence-level grading.

pdf bib abs

Improving the Faithfulness of LLM-based Abstractive Summarization with Span-level Unlikelihood Training
Sicong Huang | Qianqi Yan | Shengze Wang | Ian Lane

Abstractive summarization using large language models (LLMs) has become an essential tool for condensing information. Despite their ability to generate fluent summaries, these models often produce texts that are unfaithful to the original documents, manifested through hallucinations of specific words, phrases, or concepts. Current approaches to mitigating unfaithfulness typically involve post-processing corrections or contrastive learning from synthetically generated negative samples, which do not fully address the spectrum of errors that can arise in LLM-generated summaries. In this paper, we introduce a novel approach to fine-tune LLMs specifically to reduce the occurrence of unfaithful spans of text in generated summaries. We first annotate span-level hallucinations in LLM-generated summaries using automatic labeling with GPT-4. We then fine-tune the LLM using both summaries with no hallucinations and spans of hallucinated text to improve the faithfulness of the model. This paper introduces a dataset labeled to distinguish between faithful and unfaithful content and compare the performance of three techniques: gradient ascent, unlikelihood training, and task vector negation. Our experimental results show that unlikelihood training can effectively use span-level annotations to enhance summary faithfulness, reducing the number of summaries with hallucinations from 31% to 13%, a reduction of 58% on the CNN summarization dataset and from 33% to 20%, a reduction of 39% on the SAMSum dataset.

pdf bib abs

Context Misleads LLMs: The Role of Context Filtering in Maintaining Safe Alignment of LLMs
Jinhwa Kim | Ian Harris

While Large Language Models (LLMs) have shown significant advancements in performance, various jailbreak attacks have posed growing safety and ethical risks. Malicious users often exploit adversarial context to deceive LLMs, prompting them to generate responses to harmful queries. In this study, we propose a new defense mechanism called Context Filtering, an input pre-processing method designed to filter out untrustworthy and unreliable context while identifying the primary prompts containing the real user intent to uncover concealed malicious intent. Given that enhancing the safety of LLMs often compromises their helpfulness, potentially affecting the experience of benign users, our method aims to improve the safety of the LLMs while preserving their original performance. We evaluate the effectiveness of our model in defending against jailbreak attacks through comparative analysis, comparing our approach with state-of-the-art defense mechanisms against six different attacks and assessing the helpfulness of LLMs under these defenses. Our model demonstrates its ability to reduce the Attack Success Rates of jailbreak attacks by up to 92% while maintaining the original LLMs’ performance, achieving state-of-the-art Safety and Helpfulness balance. Notably, Context Filtering is a plug-and-play method that can be applied to all LLMs, including both white-box and black-box models, to enhance their safety without requiring any fine-tuning of the models themselves.

pdf bib abs

Lexical Familiarity Predicts Processing Depth for Nonliteral Language in Large Language Models
Lang-Ching Yeh | Yu-Chieh Wang | Shu-Kai Hsieh

This paper investigates how large language models internally process nonliteral language. Analyzing five categories spanning slang, metaphor, and idioms across all 48 layers of Gemma-3-12B-IT with Gemma Scope 2 sparse autoencoders, we find a lexical familiarity gradient: processing depth depends on available prior lexical knowledge, not figurative type. Idioms diverge at L1 as entrenched units; expressions built from familiar words (metaphors, semantic-shift and constructional slang) converge at L7–9; neologisms peak at L41, activating 3× more unique features. Paraphrase residual analysis confirms strong signals only at the gradient endpoints, yielding a three-tier hierarchy of entrenched retrieval, known-word reanalysis, and novel-word construction. Crucially, this peak-layer structure replicates in base models (Gemma-PT, Qwen-Base), demonstrating that the gradient is a robust property of pretrained representations rather than an alignment artifact. We additionally identify an activation density confound in SAE feature counts that produces spurious cross-condition convergence. Overall, processing depth is better predicted by lexical familiarity than by figurative type, with implications for robustness to non-standard language and for SAE-based interpretability.

pdf bib abs

Did You Forget What I Asked? Prospective Memory Failures in Large Language Models
Avni Mittal

Large language models often fail to satisfy formatting instructions when they must simultaneously perform demanding tasks. We study this behavior through a prospective memory-inspired lens from cognitive psychology, using a controlled paradigm that combines verifiable formatting constraints with benchmark tasks of increasing complexity. Across three model families and over 8,000 prompts, compliance drops by 2–21% under concurrent task load. Vulnerability is highly type-dependent: terminal constraints (requiring action at the response boundary) degrade most, with drops up to 50%, while avoidance constraints remain comparatively robust. A salience-enhanced format (explicit instruction framing plus a trailing reminder) recovers much of the lost compliance, restoring performance to 90–100% in many settings. Interference is bidirectional: formatting constraints can also reduce task accuracy, with one model’s GSM8K accuracy dropping from 93% to 27%. In additional stacking experiments, joint compliance declines sharply as constraints accumulate. All results use deterministic programmatic checkers, with no LLM-as-judge component, on publicly available datasets.

pdf bib abs

Don’t Want Your LLM to Recommend Nuclear Strike? Try Asking It in Japanese
Rian Touchent

Large language models are increasingly used in strategic and advisory contexts, yet their safety alignment is typically evaluated in English only. We test nine models from six providers and ask whether the language of a prompt can change a model’s decision in a high-stakes scenario. We use single-turn game-theoretic vignettes in which a model advises a nuclear-armed nation on whether to strike a defenseless opponent. The prompt is intentionally amoral and strategically identical across languages. We find that Japanese prompts reduce launch rates in the Claude model family: Claude Sonnet 4.6 drops from 40% to 0% in scenarios where the strike is unnecessary and from 93% to 17% in contested scenarios, with minimal effect when the strike is strategically rational. The effect extends to Gemini Pro 3.1 (53% to 13%). A cross-language experiment isolates the mechanism: when instructed to reason in Japanese in an English prompt, launch rates drop from 93% to 37%. It is the language the model is asked to reason in, not the language of the input, that drives the effect. When reasoning in Japanese, models spontaneously generate moral vocabulary ("moral cost", "millions of lives") that is entirely absent from the prompt. Five other models show no language effect, but they launch in nearly every condition regardless of language. The effect requires a model that already hesitates in English. These results show that LLM safety behavior is language-dependent, and that evaluating in English alone can miss both risks and safeguards encoded in other languages.

pdf bib abs

Toward Dialect-Aware Safety Evaluation for Arabic Large Language Models
Wajdi Zaghouani

Large language models (LLMs) are increasingly deployed with safety alignment mechanisms designed to prevent harmful outputs including hate speech, harassment, and unsafe instructions. However, existing safety evaluation frameworks remain heavily centered on English and standardized language varieties, creating a critical gap for languages characterized by extensive dialectal variation. Arabic provides a particularly important case: everyday communication across the Arab world occurs predominantly in regional dialects rather than Modern Standard Arabic (MSA), yet these dialects are systematically underrepresented in alignment training corpora and safety benchmarks.In this paper we introduce the Dialect Safety Gap, defined as systematic variation in LLM safety behavior across dialects of the same language. We argue that this phenomenon arises from the interaction between alignment training procedures and linguistic variation: safety alignment implicitly encodes normative patterns present in training datasets, and when dialectal forms diverge from those patterns, safety behavior degrades through lexical, morphological, and pragmatic mechanisms.We propose a formal framework grounded in algorithmic fairness that links dialect variation to alignment pipeline design, introduce both a binary DSG Score and a magnitude-aware Pairwise Dialect Inconsistency metric, and propose the Dialect-Aware Safety Evaluation Protocol (DASEP) as a practical evaluation framework. We demonstrate the feasibility of dialect-aware evaluation through a controlled, human-annotated prompt-probe experiment across five Arabic variety groups, revealing a structured gradient of safety degradation that correlates with linguistic distance from MSA.

pdf bib abs

Single-Layer Activation Edits Easily Corrupt Factual Recall but Rarely Repair It
Zacharie Bugaud

Single-layer activation edits easily corrupt a language model’s correct factual answers but rarely repair its errors. On a curated factual-recall benchmark, corruption flips 70–100% of correct answers across three models, while twelve blind methods (no access to the correct answer) fix at most 6% within every evaluation pool. Per-instance gradient optimization ostensibly fixes 39%, but norm-constrained analysis reveals a magnitude artifact: at oracle-matched norms the fix rate drops to random, directions are nearly orthogonal to oracle directions (cos = -0.04), and collateral damage makes the net effect negative. An oracle ablation controlling for budget, target identity, and directional noise points to a direction-selection bottleneck: repair requires a precise, per-question direction that blind methods cannot locate. Target-informed methods partially succeed but none generalizes to unseen distributions.

pdf bib abs

Truth or Dare: Analyzing LLM Susceptibility to External Evidence of Varying Factuality
Han-Yu Su | Kuan-Yu Chu | Yung-Hui Li | Lun-Wei Ku

Modern Large Language Models (LLMs) often rely on Retrieval-Augmented Generation (RAG) to access up-to-date information; however, retrieved corpora may contain misleading, outdated, or incorrect content, raising concerns about how such evidence affects model reliability. In this work, we investigate the susceptibility of LLMs to false external evidence. Existing studies have shown that poisoned external corpora can mislead LLM responses; yet, there is still a lack of studies on the effects of different evidence properties. To bridge this gap, we design comprehensive experiments along three dimensions: styles of evidence, quantity of evidence, and the semantic similarity between external messages and the model’s internal belief. We find that instructive-style evidence demonstrates the most severe performance degradation. On the other hand, we observe a steady decline in model response quality as the amount of false evidence accumulates. Finally, we show that LLMs are more susceptible to factually incorrect evidence when their semantic similarity is close to the model’s parametric knowledge.

pdf bib abs

The Halo Effect and Language Takeover: Spatiotemporal Attention Decay Explains Vision-Language Model Failures in Simple Visual Counting
Haochen Zhao | Sujian Li

Despite their remarkable capabilities in complex multimodal reasoning, Vision Language Models (VLMs) exhibit a perplexing inability to perform elementary visual counting tasks reliably. Existing hypotheses, often centering on input resolution or patch tokenization, fail to fully explain the stochastic nature of these errors, particularly in multi-digit generation. In this work, we investigate the internal decision-making dynamics of VLMs (e.g., Qwen3-VL, Gemma3) through the lens of attention mechanisms. By leveraging a controlled synthetic dataset and introducing novel metrics for Visual Sparsity and Entropy, we discover a novel phenomenon: Spatiotemporal Attention Decay. Our analysis reveals two distinct failure modes. Spatially, models exhibit a Halo Effect, where attention focuses on the peripheral convex hull of object clusters rather than penetrating the geometric centers of individual instances. Temporally, we observe a phenomenon of Language Takeover: during auto-regressive decoding, visual grounding decays rapidly after the initial token. Quantitative analysis confirms that as attention sparsity drops and entropy rises, the generation of subsequent digits degenerates from visual perception into hallucination driven by language priors. These findings suggest that counting failures stem from the model’s inability to maintain spatiotemporal focus, highlighting the need for mechanisms that enforce persistent visual grounding.

pdf bib abs

Why is "Chicago" Predictive of Deceptive Reviews? Using LLMs to Discover Language Phenomena from Lexical Cues
Jiaming Qu | Mengtian Guo | Yue Wang

Deceptive reviews mislead consumers, harm businesses, and undermine trust in online marketplaces. Machine learning classifiers can learn from large amounts of data to distinguish deceptive reviews from genuine ones. However, the distinguishing features learned by these classifiers are often subtle, fragmented, and difficult for humans to interpret, which can hinder user understanding and trust. In this work, we study whether large language models (LLMs) can translate such unintuitive lexical cues into human-understandable language phenomena. We propose a conjecture-then-validate framework, and show that language phenomena obtained in this manner are empirically grounded in data, generalizable across similar domains, and more predictive than phenomena derived from LLMs’ prior knowledge or in-context learning. Such phenomena can aid people in critically assessing the credibility of online reviews in environments where deception detection classifiers are unavailable.

pdf bib abs

Domain-Dependent Safety Behavior in Open-Weight LLMs: An Empirical Study Across Seven Ethical Domains
Zacharie Bugaud

We present a systematic study of domain-dependent safety behavior in open-weight LLMs: 7 standardized experiments across 7 ethical domains, testing 5 models (12B–70B) in 4,200 interactions with dual-judge validation. Using a dual-condition methodology, each scenario tested in both an analytical framing (identify the harm) and an operational framing (help commit the harm), we find compliance rates vary from 14.7% (human trafficking) to 85.7% (surveillance design), a 71-percentage-point span with non-overlapping cluster-bootstrapped 95% CIs. Domain accounts for 36% of pair-level variance in harm scores, with scenario (26%) exceeding model identity (15%). A stable model safety hierarchy persists across domains (mean Spearman ρ = 0.68). These findings demonstrate that safety alignment is not a general capability: aggregate safety scores mask critical domain-level variation, motivating domain-specific safety auditing for trustworthy deployment.

pdf bib abs

A Systematic Comparison between Extractive Self-Explanations and Human Rationales in Text Classification
Stephanie Brandl | Oliver Eberle

Instruction-tuned LLMs are able to provide *an* explanation about their output to users by generating self-explanations, without requiring the application of complex interpretability techniques. In this paper, we analyse whether this ability results in a *good* explanation. We evaluate self-explanations in the form of input rationales with respect to their plausibility to humans. We study three text classification tasks: sentiment classification, forced labour detection and claim verification. We include Danish and Italian translations of the sentiment classification task and compare self-explanations to human annotations. For this, we collected human rationale annotations for Climate-Fever, a claim verification dataset. We furthermore evaluate the faithfulness of human and self-explanation rationales with respect to correct model predictions, and extend the study by incorporating post-hoc attribution-based explanations. We analyse four open-weight LLMs and find that alignment between self-explanations and human rationales highly depends on text length and task complexity. Nevertheless, self-explanations yield faithful subsets of token-level rationales, whereas post-hoc attribution methods tend to emphasize structural and formatting tokens, reflecting fundamentally different explanation strategies.

pdf bib abs

Guiding Giants: Lightweight Controllers for Weighted Activation Steering in LLMs
Amr Hegazy | Mostafa Elhoushi | Amr Alanwar

Controlling undesirable LLM behaviors typically requires costly fine-tuning, while existing inference-time steering methods lack fine-grained adaptivity. We introduce a lightweight, trainable controller network for adaptive inference-time control. The controller observes intermediate LLM activations to predict a global scaling factor and layer-specific weights, which dynamically modulate a pre-computed “refusal direction” vector. Trained on harmful and benign prompts, the controller learns to apply nuanced, layer-aware steering selectively. Experiments on Llama and Mistral models show our method significantly increases refusal rates on safety benchmarks like ToxicChat, outperforming existing approaches without altering the original model parameters.

pdf bib abs

SURGELLM: Rethinking Multi-Task Evaluation through Task-Aware Feature Gating with Class-Balanced Normalization
Noor Islam S. Mohammad | Ulug Bayazit

Fine-tuned encoders deployed across heterogeneous NLP tasks face three compounding problems: mismatched inductive biases, class-imbalance corruption of feature statistics, and no mechanism to condition attention on external lexical knowledge. We introduce SURGELLM, a unified transformer framework that addresses each with a dedicated lightweight module: a surgical feature gate (learned per-dimension sigmoid over curated lexical indicators and [CLS]; provably degenerates to identity when features are uninformative), task-conditioned prefix tokens (quantized feature values and task identity prepended to every input), and Instance-Weighted Normalization (IWN; removes class-prior bias from gate statistics). We prove an excess-risk bound linking gate benefit to surgical feature alignment. Across four tasks, SST-2, multi-hop retrieval, LLM-prompt attribution, and authorship detection, covering 17,830 examples and eleven model variants over three seeds, the IWN variant achieves macro-F1 0.940 (+0.036 over the strongest non-IWN baseline; +0.130 on authorship detection). A random-vocabulary control (-0.028 avg. F1) confirms gains are lexical, not parametric. Code, vocabularies, and a 99.5%-recovery auto-extraction recipe are released.

pdf bib abs

In this paper we present a systematic study of social bias in small- to mid-scale Large Language Models (LLMs), focusing on gender, religion, and race. Using our SALT (Social Appropriateness in LLM Text) dataset, we explore two bias categories—Theoretical and Practical. Theoretical bias covers General Debate and Positioned Debate while practical bias includes Career Advice, Personal Advice, and Resume Generation. We quantify bias using win-rate gaps in general debate, and negative-role assignments in positioned debate. For Practical bias, we anonymize model outputs to remove explicit demographic cues and use DeepSeek-R1 as an automated evaluator, measuring outcome disparities across groups. We also examine systemic issues in LLM-based evaluation including evaluation bias, positional bias, and length bias and validate our findings through human annotation. Our results show consistent disadvantages for White, Christian, and male-associated outputs across multiple tasks. Larger models often amplify these disparities, highlighting that scale does not guarantee fairness.

pdf bib abs

GateKD: Confidence-Gated Closed-Loop Distillation for Robust Reasoning
Kasidit Sermsri | Teerapong Panboonyuen

Distilling multi-step reasoning abilities from large language models (LLMs) into compact student models remains challenging due to noisy rationales, hallucinated supervision, and static teacher–student interactions. Existing reasoning distillation methods, including mentor-based approaches, predominantly operate in an open-loop manner, implicitly assuming uniform teacher reliability and consequently propagating erroneous intermediate reasoning. We propose GateKD, a confidence-gated closed-loop distillation framework that enables robust reasoning transfer by treating the teacher as a dynamic gatekeeper rather than a static oracle. GateKD introduces three complementary mechanisms: (i) confidence-gated soft supervision that selectively distills reliable predictive signals, (ii) gated hidden-state evolution that aligns intermediate representations only when teacher confidence is high, and (iii) reliability-filtered attention distillation that preserves stable reasoning structures while suppressing noisy patterns. These components jointly form a closed feedback loop in which teacher confidence continuously modulates the distillation process, reducing hallucination transfer and stabilizing student reasoning. Extensive experiments across commonsense, logical, and symbolic reasoning benchmarks, using T5 and Flan-T5 backbones of varying sizes, demonstrate that GateKD consistently outperforms strong open-loop distillation baselines. Notably, GateKD yields substantial gains in logical and symbolic reasoning, remains robust under low-resource distillation settings, and shows clear performance degradation when any gating component is removed. Our results highlight that confidence-gated closed-loop supervision is critical for building reliable and scalable small reasoning models.

pdf bib abs

The Geometry of Refusal: Linear Instability in Safety-Aligned LLMs
Shivam Ratnakar | Kartikeya Vats

Modern Large Language Models (LLMs) rely on extensive safety alignment, yet the mechanistic basis of refusal remains opaque. In this work, we investigate whether safety compliance is a deep semantic decision or a manipulable linear feature. We introduce Contrastive Logit Steering (CLS), a zero-optimization framework that isolates the "refusal direction" by contrasting hidden states derived from safe and unrestricted system prompts. Unlike representation engineering methods that intervene on internal activations, CLS operates directly on the output distribution, serving as a diagnostic probe for alignment fragility. When coupled with prefix injection to bypass initial refusal reflexes, this method induces a phase transition where guardrails collapse. Our experiments on 7 model families reveal that safety implementation is architecturally deterministic. While models like Llama-3.1 exhibit a "Late Decision" topology that is easily bypassed by CLS (reaching 95% ASR in milliseconds), others like Qwen-2.5 demonstrate "Early Divergence" by integrating safety mid-computation. Direct comparison with established activation-level steering methods shows that CLS achieves substantially higher attack success rates on Llama 2 (73% vs. 22.6%) and Qwen 7B (91% vs. 79.2%), demonstrating that logit-level intervention exposes alignment vulnerabilities that hidden-state methods underestimate. Beyond attacks, we show that this linearity enables bidirectional control: inverting the steering vector "hardens" models against jailbreaks without retraining. Our findings suggest that current alignment techniques create a steerable "safety axis" that serves as both a critical vulnerability and a precise primitive for defense.

pdf bib abs

The Conservative AI: Diagnosing Hold Bias and Reliability Limits in Persona-Based Monetary Policy Simulation
Giyong Kim | Sojung Kim

We examine whether large language models (LLMs) can reliably simulate historical FOMC policy decisions and whether persona-based agentic deliberation improves performance. Using strictly time-consistent vintage economic information, we evaluate multiple state-of-the-art LLMs on a three-way Hike/Hold/Cut classification task in both single-agent and multi-agent settings. Single-LLM baselines achieve nontrivial accuracy and track broad policy regime shifts, establishing a simple but strong benchmark. However, we identify a systematic behavioral asymmetry that we term Hold bias: models disproportionately favor Hold decisions and remain reluctant to predict Cut outcomes even during easing cycles. This conservatism is especially costly around regime turning points, where reliable adaptation matters most. We further find that standard agentic workflows, including debate and consensus-style aggregation, do not mitigate this problem and often amplify caution rather than improve accuracy. Overall, our results show that plausible deliberation is not sufficient for trustworthy decision support. Progress will require agentic systems explicitly designed to diagnose and correct structural bias, rather than merely reproducing surface-level committee interaction.