Susmit Jha


2026

Chain-of-thought (CoT) prompting makes language models write step-by-step explanations, but these steps may not match what the model actually used to choose its answer. Existing faithfulness checks often only test whether changing the written chain changes the answer, without verifying whether the steps are truly supported by the given evidence, or they require special prompts that do not generalize well. We present NSF-CoT, a neuro-symbolic formal verification method that checks CoT faithfulness step by step for contextual question answering. NSF-CoT (1) converts the provided context facts and each reasoning step into simple logical statements, (2) uses counterfactual attribution to estimate which context facts the model relied on while generating each step, and (3) verifies each step using a hybrid checker that combines an SMT solver with an LLM-based entailment judge. For every step, we score groundedness (supported by the full context), validity (supported by the facts the model relied on), and utility (helps reach the final answer), and combine them into a faithfulness score. Across OpenBookQA, QASC, and HotpotQA, NSF-CoT consistently outperforms causal mediation, perturbation probes, and behavioral monitoring, and it identifies reasoning steps that are not only unfaithful but also harmful to the model’s final decision.

2025

We propose a novel inference-time out-of-domain (OOD) detection algorithm for specialized large language models (LLMs). Despite achieving state-of-the-art performance on in-domain tasks through fine-tuning, specialized LLMs remain vulnerable to incorrect or unreliable outputs when presented with OOD inputs, posing risks in critical applications. Our method leverages the Inductive Conformal Anomaly Detection (ICAD) framework, using a new non-conformity measure based on the model’s dropout tolerance. Motivated by recent findings on polysemanticity and redundancy in LLMs, we hypothesize that in-domain inputs exhibit higher dropout tolerance than OOD inputs. We aggregate dropout tolerance across multiple layers via a valid ensemble approach, improving detection while maintaining theoretical false alarm bounds from ICAD. Experiments with medical-specialized LLMs show that our approach detects OOD inputs better than baseline methods, with AUROC improvements of 2% to 37% when treating OOD datapoints as positives and in-domain test datapoints as negatives.

2024

Textual backdoor attacks pose significant security threats. Current detection approaches, typically relying on intermediate feature representation or reconstructing potential triggers, are task-specific and less effective beyond sentence classification, struggling with tasks like question answering and named entity recognition. We introduce TABDet (Task-Agnostic Backdoor Detector), a pioneering task-agnostic method for backdoor detection. TABDet leverages final layer logits combined with an efficient pooling technique, enabling unified logit representation across three prominent NLP tasks. TABDet can jointly learn from diverse task-specific models, demonstrating superior detection efficacy over traditional task-specific methods.