Purva Chiniya

2026

Gradient-Controlled Decoding: A Safety Guardrail for LLMs with Dual-Anchor Steering
Purva Chiniya | Kevin Joseph Scaria | Sagar Chaturvedi
Proceedings of the Fifteenth Language Resources and Evaluation Conference

Large language models (LLMs) remain susceptible to jailbreak and direct prompt-injection attacks, yet the strongest defensive filters frequently over- refuse benign queries and degrade user experience. Previous work on prompt injection detection such as, GradSafe, detects unsafe prompts with a single "accept all" anchor token, but its threshold is brittle and it offers no deterministic guarantee that harmful content will not be emitted once decoding begins. We introduce Gradient-Controlled Decoding (GCD), a training-free guardrail that combines with both an acceptance anchor ("Sure") and refusal anchor ("Sorry") tightening the decision boundary and lowering false positives. In the mitigation stage, if a prompt is flagged, GCD preset-injects one or two refusal tokens ("Sorry, I can’t . . . ") before autoregressive decoding resumes, guaranteeing first- token safety regardless of sampling strategy. On ToxicChat, XSTest-v2, and AdvBench, GCD reduces false positives by 52% vs. GradSafe at comparable recall, lowers attack success rate by up to 20% vs. the strongest decoding-only baseline, adds under 15-20 ms latency on an average on V100 instances, transfers to LLaMA-2-7B, Mixtral-8×7B, and Qwen-2-7B, and requires only 20 template prompts. GCD is a lightweight, scalable safety layer for real-time LLM deployment.

2023

pdf bib abs

CoSyn: Detecting Implicit Hate Speech in Online Conversations Using a Context Synergized Hyperbolic Network
Sreyan Ghosh | Manan Suri | Purva Chiniya | Utkarsh Tyagi | Sonal Kumar | Dinesh Manocha
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

The tremendous growth of social media users interacting in online conversations has led to significant growth in hate speech affecting people from various demographics. Most of the prior works focus on detecting explicit hate speech, which is overt and leverages hateful phrases, with very little work focusing on detecting hate speech that is implicit or denotes hatred through indirect or coded language. In this paper, we present CoSyn, a context synergized neural network that explicitly incorporates user- and conversational-context for detecting implicit hate speech in online conversations. CoSyn introduces novel ways to encode these external contexts and employs a novel context interaction mechanism that clearly captures the interplay between them, making independent assessments of the amounts of information to be retrieved from these noisy contexts. Additionally, it carries out all these operations in the hyperbolic space to account for the scale-free dynamics of social media. We demonstrate the effectiveness of CoSyn on 6 hate speech datasets and show that CoSyn outperforms all our baselines in detecting implicit hate speech with absolute improvements in the range of 1.24% - 57.8%. We make our code available.

Co-authors

Manan Suri 1

Utkarsh Tyagi 1

Venues

EMNLP1
LREC1

Fix author