SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models

Aafiya Shamshad Hussain, Gaurav Srivastava, Alvi Md Ishmam, Zaber Ibn Abdul Hakim, Chris Thomas


Abstract
Multimodal foundation models that integrate audio, vision, and language achieve strong performance on reasoning and generation tasks, yet their robustness to adversarial manipulation remains poorly understood. We study a realistic and underexplored threat model: **untargeted, audio-only adversarial attacks** on trimodal audio–video–language models. We analyze six complementary attack objectives that target different stages of multimodal processing, including audio encoder representations, cross-modal attention, hidden states, and output likelihoods. Across four state-of-the-art models and multiple benchmarks, we show that audio-only perturbations can induce severe multimodal failures, achieving up to **96% attack success rate.** We further show that attacks can be successful at low perceptual distortions (LPIPS ≤ 0.08, SI-SNR ≥ 0 dB) and benefit more from extended optimization than increased data scale. We evaluate the feasibility of these attacks under physically realistic conditions by incorporating room impulse response (RIR) modeling, showing that audio-only perturbations remain effective under environmental transformations and thus highlight the practical risk of single-modality attacks in real-world multimodal systems. Transferability across models and encoders remains limited, while speech recognition systems such as Whisper primarily respond to perturbation magnitude, achieving **>97% attack success** under severe distortion. These results expose a previously overlooked single-modality attack surface in multimodal systems and motivate defenses that enforce cross-modal consistency. Our project website is available at https://aafiya-h.github.io/soundbreak/.
Anthology ID:
2026.acl-long.1275
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
27635–27663
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1275/
DOI:
Bibkey:
Cite (ACL):
Aafiya Shamshad Hussain, Gaurav Srivastava, Alvi Md Ishmam, Zaber Ibn Abdul Hakim, and Chris Thomas. 2026. SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 27635–27663, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
SoundBreak: A Systematic Study of Audio-Only Adversarial Attacks on Trimodal Models (Hussain et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1275.pdf
Checklist:
 2026.acl-long.1275.checklist.pdf