Exploiting Prompt-induced Confidence for Black-Box Attacks on LLMs

Meina Chen; Yihong Tang; Kehai Chen (陈科海)

doi:10.18653/v1/2025.findings-emnlp.692

Exploiting Prompt-induced Confidence for Black-Box Attacks on LLMs

Abstract

Large language models (LLMs) are vulnerable to adversarial attacks even in strict black-box settings with only hard-label feedback.Existing attacks suffer from inefficient search due to lack of informative signals such as logits or probabilities. In this work, we propose Prompt-Guided Ensemble Attack (PGEA), a novel black-box framework that leverages prompt-induced confidence, which reflects variations in a model’s self-assessed certainty across different prompt templates, as an auxiliary signal to guide attacks. We first demonstrate that confidence estimates vary significantly with prompt phrasing despite unchanged predictions. We then integrate these confidence signals in a two-stage attack: (1) estimating token-level vulnerability via confidence elicitation, and (2) applying ensemble word-level substitutions guided by these estimates. Experiments on LLaMA-3-8B-Instruct and Mistral-7B-Instruct-v0.3 on three classification tasks show that PGEA improves the attack success rate and query efficiency while maintaining semantic fidelity. Our results highlight that verbalized confidence, even without access to probabilities, is a valuable and underexplored signal for black-box adversarial attacks. The code is available at https://github.com/cmn-bits/PGEA-main.

Anthology ID:: 2025.findings-emnlp.692
Volume:: Findings of the Association for Computational Linguistics: EMNLP 2025
Month:: November
Year:: 2025
Address:: Suzhou, China
Editors:: Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, Violet Peng
Venue:: Findings
SIG:
Publisher:: Association for Computational Linguistics
Note:
Pages:: 12897–12903
Language:
URL:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.692/
DOI:: 10.18653/v1/2025.findings-emnlp.692
Bibkey:
Cite (ACL):: Meina Chen, Yihong Tang, and Kehai Chen. 2025. Exploiting Prompt-induced Confidence for Black-Box Attacks on LLMs. In Findings of the Association for Computational Linguistics: EMNLP 2025, pages 12897–12903, Suzhou, China. Association for Computational Linguistics.
Cite (Informal):: Exploiting Prompt-induced Confidence for Black-Box Attacks on LLMs (Chen et al., Findings 2025)
Copy Citation:
PDF:: https://preview.aclanthology.org/author-page-yu-wang-polytechnic/2025.findings-emnlp.692.pdf
Checklist:: 2025.findings-emnlp.692.checklist.pdf

PDF Cite Search Checklist Fix data