Turning Logic Against Itself: Probing Model Defenses Through Contrastive Questions

Rachneet Singh Sachdeva; Rima Hazra; Iryna Gurevych

Turning Logic Against Itself: Probing Model Defenses Through Contrastive Questions

Rachneet Singh Sachdeva, Rima Hazra, Iryna Gurevych

Abstract

Large language models, despite extensive alignment with human values and ethical principles, remain vulnerable to sophisticated jailbreak attacks that exploit their reasoning abilities. Existing safety measures often detect overt malicious intent but fail to address subtle, reasoning-driven vulnerabilities. In this work, we introduce POATE (Polar Opposite query generation, Adversarial Template construction, and Elaboration), a novel jailbreak technique that harnesses contrastive reasoning to provoke unethical responses. POATE crafts semantically opposing intents and integrates them with adversarial templates, steering models toward harmful outputs with remarkable subtlety. We conduct extensive evaluation across six diverse language model families of varying parameter sizes to demonstrate the robustness of the attack, achieving significantly higher attack success rates (44%) compared to existing methods. To counter this, we propose Intent-Aware CoT and Reverse Thinking CoT, which decompose queries to detect malicious intent and reason in reverse to evaluate and reject harmful responses. These methods enhance reasoning robustness and strengthen the model’s defense against adversarial exploits.

Anthology ID:: 2025.emnlp-main.1762
Volume:: Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: EMNLP
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 34734–34764
Language:
URL:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1762/
DOI:
Bibkey:
Cite (ACL):: Rachneet Singh Sachdeva, Rima Hazra, and Iryna Gurevych. 2025. Turning Logic Against Itself: Probing Model Defenses Through Contrastive Questions. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 34734–34764, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Turning Logic Against Itself: Probing Model Defenses Through Contrastive Questions (Sachdeva et al., EMNLP 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/ingest-emnlp/2025.emnlp-main.1762.pdf
Checklist:: 2025.emnlp-main.1762.checklist.pdf

PDF Cite Search Checklist Fix data