Meina Chen


Fixing paper assignments

  1. Please select all papers that belong to the same person.
  2. Indicate below which author they should be assigned to.
Provide a valid ORCID iD here. This will be used to match future papers to this author.
Provide the name of the school or the university where the author has received or will receive their highest degree (e.g., Ph.D. institution for researchers, or current affiliation for students). This will be used to form the new author page ID, if needed.

TODO: "submit" and "cancel" buttons here


2025

pdf bib
Exploiting Prompt-induced Confidence for Black-Box Attacks on LLMs
Meina Chen | Yihong Tang | Kehai Chen
Findings of the Association for Computational Linguistics: EMNLP 2025

Large language models (LLMs) are vulnerable to adversarial attacks even in strict black-box settings with only hard-label feedback.Existing attacks suffer from inefficient search due to lack of informative signals such as logits or probabilities. In this work, we propose Prompt-Guided Ensemble Attack (PGEA), a novel black-box framework that leverages prompt-induced confidence, which reflects variations in a model’s self-assessed certainty across different prompt templates, as an auxiliary signal to guide attacks. We first demonstrate that confidence estimates vary significantly with prompt phrasing despite unchanged predictions. We then integrate these confidence signals in a two-stage attack: (1) estimating token-level vulnerability via confidence elicitation, and (2) applying ensemble word-level substitutions guided by these estimates. Experiments on LLaMA-3-8B-Instruct and Mistral-7B-Instruct-v0.3 on three classification tasks show that PGEA improves the attack success rate and query efficiency while maintaining semantic fidelity. Our results highlight that verbalized confidence, even without access to probabilities, is a valuable and underexplored signal for black-box adversarial attacks. The code is available at https://github.com/cmn-bits/PGEA-main.