Nirmalendu Prakash
2025
Understanding Refusal in Language Models with Sparse Autoencoders
Wei Jie Yeo
|
Nirmalendu Prakash
|
Clement Neo
|
Ranjan Satapathy
|
Roy Ka-Wei Lee
|
Erik Cambria
Findings of the Association for Computational Linguistics: EMNLP 2025
Refusal is a key safety behavior in aligned language models, yet the internal mechanisms driving refusals remain opaque. In this work, we conduct a mechanistic study of refusal in instruction-tuned LLMs using sparse autoencoders to identify latent features that causally mediate refusal behaviors. We apply our method to two open-source chat models and intervene on refusal-related features to assess their influence on generation, validating their behavioral impact across multiple harmful datasets. This enables a fine-grained inspection of how refusal manifests at the activation level and addresses key research questions such as investigating upstream-downstream latent relationship and understanding the mechanisms of adversarial jailbreaking techniques. We also establish the usefulness of refusal features in enhancing generalization for linear probes to out-of-distribution adversarial samples in classification tasks.
2024
SGHateCheck: Functional Tests for Detecting Hate Speech in Low-Resource Languages of Singapore
Ri Chi Ng
|
Nirmalendu Prakash
|
Ming Shan Hee
|
Kenny Tsu Wei Choo
|
Roy Ka-wei Lee
Proceedings of the 8th Workshop on Online Abuse and Harms (WOAH 2024)
To address the limitations of current hate speech detection models, we introduce SGHateCheck, a novel framework designed for the linguistic and cultural context of Singapore and Southeast Asia. It extends the functional testing approach of HateCheck and MHC, employing large language models for translation and paraphrasing into Singapore’s main languages, and refining these with native annotators. SGHateCheck reveals critical flaws in state-of-the-art models, highlighting their inadequacy in sensitive content moderation. This work aims to foster the development of more effective hate speech detection tools for diverse linguistic environments, particularly for Singapore and Southeast Asia contexts.
2023
Layered Bias: Interpreting Bias in Pretrained Large Language Models
Nirmalendu Prakash
|
Roy Ka-Wei Lee
Proceedings of the 6th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP
Large language models (LLMs) like GPT and PALM have excelled in numerous natural language processing (NLP) tasks such as text generation, question answering, and translation. However, they are also found to have inherent social biases. To address this, recent studies have proposed debiasing techniques like iterative nullspace projection (INLP) and Counterfactual Data Augmentation (CDA). Additionally, there’s growing interest in understanding the intricacies of these models. Some researchers focus on individual neural units, while others examine specific layers. In our study, we benchmark newly released models, assess the impact of debiasing methods, and investigate how biases are linked to different transformer layers using a method called Logit Lens. Specifically, we evaluate three modern LLMs: OPT, LLaMA, and LLaMA2, and their debiased versions. Our experiments are based on two popular bias evaluation datasets, StereoSet and CrowS-Pairs, and we perform a layer-by-layer analysis using the Logit Lens.
Search
Fix author
Co-authors
- Roy Ka-Wei Lee 3
- Erik Cambria 1
- Kenny Tsu Wei Choo 1
- Ming Shan Hee 1
- Clement Neo 1
- show all...