David Huang
2025
Stronger Universal and Transferable Attacks by Suppressing Refusals
David Huang
|
Avidan Shah
|
Alexandre Araujo
|
David Wagner
|
Chawin Sitawarin
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Making large language models (LLMs) safe for mass deployment is a complex and ongoing challenge. Efforts have focused on aligning models to human preferences (RLHF), essentially embedding a “safety feature” into the model’s parameters. The Greedy Coordinate Gradient (GCG) algorithm (Zou et al., 2023b) emerges as one of the most popular automated jailbreaks, an attack that circumvents this safety training. So far, it is believed that such optimization-based attacks (unlike hand-crafted ones) are sample-specific. To make them universal and transferable, one has to incorporate multiple samples and models into the objective function. Contrary to this belief, we find that the adversarial prompts discovered by such optimizers are inherently prompt-universal and transferable, even when optimized on a single model and a single harmful request. To further exploit this phenomenon, we introduce IRIS, a new objective to these optimizers to explicitly deactivate the safety feature to create an even stronger universal and transferable attack. Without requiring a large number of queries or accessing output token probabilities, our universal and transferable attack achieves a 25% success rate against the state-of-the-art Circuit Breaker defense (Zou et al., 2024), compared to 2.5% by white-box GCG. Crucially, IRIS also attains state-of-the-art transfer rates on frontier models: GPT-3.5-Turbo (90%), GPT-4o-mini (86%), GPT-4o (76%), o1-mini (54%), o1-preview (48%), o3-mini (66%), and deepseek-reasoner (90%).
2024
ClaimLens: Automated, Explainable Fact-Checking on Voting Claims Using Frame-Semantics
Jacob Devasier
|
Rishabh Mediratta
|
Phuong Anh Le
|
David Huang
|
Chengkai Li
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations
We present ClaimLens, an automated fact-checking system focused on voting-related factual claims. Existing fact-checking solutions often lack transparency, making it difficult for users to trust and understand the reasoning behind the outcomes. In this work, we address the critical need for transparent and explainable automated fact-checking solutions. We propose a novel approach that leverages frame-semantic parsing to provide structured and interpretable fact verification. By focusing on voting-related claims, we can utilize publicly available voting records from official United States congressional sources and the established Vote semantic frame to extract relevant information from claims. Furthermore, we propose novel data augmentation techniques for frame-semantic parsing, a task known to lack robust annotated data, which leads to a +9.5% macro F1 score on frame element identification over our baseline.
Search
Fix data
Co-authors
- Alexandre Araujo 1
- Jacob Devasier 1
- Phuong Anh Le 1
- Chengkai Li 1
- Rishabh Mediratta 1
- show all...