2025
pdf
bib
abs
SELF-PERCEPT: Introspection Improves Large Language Models’ Detection of Multi-Person Mental Manipulation in Conversations
Danush Khanna
|
Pratinav Seth
|
Sidhaarth Sredharan Murali
|
Aditya Kumar Guru
|
Siddharth Shukla
|
Tanuj Tyagi
|
Sandeep Chaurasia
|
Kripabandhu Ghosh
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers)
Mental manipulation is a subtle yet pervasive form of abuse in interpersonal communication, making its detection critical for safeguarding potential victims. However, due to manipulation’s nuanced and context-specific nature, identifying manipulative language in complex, multi-turn, and multi-person conversations remains a significant challenge for large language models (LLMs). To address this gap, we introduce the MultiManip dataset, comprising 220 multi-turn, multi-person dialogues balanced between manipulative and non-manipulative interactions, all drawn from reality shows that mimic real-world scenarios. For manipulative interactions, it includes 11 distinct manipulations depicting real-life scenarios. We conduct extensive evaluations of state-of-the-art LLMs, such as GPT-4o and Llama-3.1-8B, employing various prompting strategies. Despite their capabilities, these models often struggle to detect manipulation effectively. To overcome this limitation, we propose SELF-PERCEPT, a novel, two-stage prompting framework inspired by Self-Perception Theory, demonstrating strong performance in detecting multi-person, multi-turn mental manipulation. Our code and data are publicly available at https://github.com/danushkhanna/self-percept .
pdf
bib
abs
Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations
Abhilekh Borah
|
Chhavi Sharma
|
Danush Khanna
|
Utkarsh Bhatt
|
Gurpreet Singh
|
Hasnat Md Abdullah
|
Raghav Kaushik Ravi
|
Vinija Jain
|
Jyoti Patel
|
Shubham Singh
|
Vasu Sharma
|
Arpita Vats
|
Rahul Raja
|
Aman Chadha
|
Amitava Das
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Alignment is no longer a luxury; it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the **Alignment Quality Index (AQI)**. This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the *Davies-Bouldin score (DBS)*, *Dunn index (DI)*, *Xie-Beni index (XBI)*, and *Calinski-Harabasz index (CHI)* across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding-invariant tool for behavior-agnostic safety auditing. Additionally, we propose the **LITMUS** dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI’s correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.
pdf
bib
abs
DPO Kernels: A Semantically-Aware, Kernel-Enhanced, and Divergence-Rich Paradigm for Direct Preference Optimization
Amitava Das
|
Suranjana Trivedy
|
Danush Khanna
|
Yaswanth Narsupalli
|
Basab Ghosh
|
Rajarshi Roy
|
Gurpreet Singh
|
Vinija Jain
|
Vasu Sharma
|
Aishwarya Naresh Reganti
|
Aman Chadha
Findings of the Association for Computational Linguistics: ACL 2025
The rapid advancement of large language models (LLMs) has revolutionized numerous applications, but presents significant challenges in aligning these models with diverse human values, ethical standards, and specific user preferences. Direct Preference Optimization (DPO) has become a cornerstone for preference alignment but is constrained by reliance on fixed divergence measures and limited feature transformations. We introduce DPO-Kernels, an innovative enhancement of DPO that integrates kernel methods to overcome these challenges through four key contributions: (i) Kernelized Representations: These representations enhance divergence measures by using polynomial, RBF, Mahalanobis, and spectral kernels for richer feature transformations. Additionally, we introduce a hybrid loss that combines embedding-based loss with probability-based loss; (ii) Divergence Alternatives: Beyond Kullback–Leibler (KL), we incorporate Jensen-Shannon, Hellinger, Rényi, Bhattacharyya, Wasserstein, and other f-divergences to boost stability and robustness; (iii) Data-Driven Selection: Choosing the optimal kernel-divergence pair among 28 combinations (4 kernels × 7 divergences) is challenging. We introduce automatic metrics that analyze the data to select the best kernel-divergence pair, eliminating the need for manual tuning; (iv) Hierarchical Mixture of Kernels (HMK): Combining local and global kernels for precise and large-scale semantic modeling. This approach automatically selects the optimal kernel mixture during training, enhancing modeling flexibility. DPO-Kernels achieve state-of-the-art generalization in factuality, safety, reasoning, and instruction following across 12 datasets. While alignment risks overfitting, Heavy-Tailed Self-Regularization (HT-SR) theory confirms that DPO-Kernels ensure robust generalization in LLMs. Comprehensive resources are available to facilitate further research and application of DPO-Kernels.
2024
pdf
bib
abs
Legal Judgment Reimagined: PredEx and the Rise of Intelligent AI Interpretation in Indian Courts
Shubham Kumar Nigam
|
Anurag Sharma
|
Danush Khanna
|
Noel Shallum
|
Kripabandhu Ghosh
|
Arnab Bhattacharya
Findings of the Association for Computational Linguistics: ACL 2024
In the era of Large Language Models (LLMs), predicting judicial outcomes poses significant challenges due to the complexity of legal proceedings and the scarcity of expert-annotated datasets. Addressing this, we introduce Prediction with Explanation (PredEx), the largest expert-annotated dataset for legal judgment prediction and explanation in the Indian context, featuring over 15,000 annotations. This groundbreaking corpus significantly enhances the training and evaluation of AI models in legal analysis, with innovations including the application of instruction tuning to LLMs. This method has markedly improved the predictive accuracy and explanatory depth of these models for legal judgments. We employed various transformer-based models, tailored for both general and Indian legal contexts. Through rigorous lexical, semantic, and expert assessments, our models effectively leverage PredEx to provide precise predictions and meaningful explanations, establishing it as a valuable benchmark for both the legal profession and the NLP community.