Scale Is All You Need: Analyzing Modality Interaction and Speaker Intent Without Fine-Tuning

Animesh Gurjar, Nikhil Krishnaswamy


Abstract
Understanding sarcasm requires integrating cues from language, voice, and facial expression. Recent work has achieved impressive results using large multimodal Transformers, but such models are computationally expensive and often obscure how each modality contributes to the final prediction. This paper introduces a lightweight, interpretable framework for multimodal sarcasm detection that combines frozen text, audio, and visual embeddings from pretrained encoders through compact fusion heads. Using the MUStARD++Balanced dataset, we show that early fusion of textual and acoustic features improves over the best unimodal baseline. Character-specific evaluation further shows that sarcasm expressed through overt prosodic and visual cues is substantially easier to detect than monotone, context-dependent sarcasm. Additionally, we evaluate generalization to different characters through leave-one-speaker-out (LOSO) experiments and run ablation-style transfer experiments on two speakers with similar sarcasm distributions. These findings demonstrate that effective multimodal sarcasm understanding can emerge from frozen, resource-efficient representations without large-scale fine-tuning, emphasizing the importance of modality interaction and delivery style rather than model scale.
Anthology ID:
2026.eacl-srw.36
Volume:
Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop)
Month:
March
Year:
2026
Address:
Rabat, Morocco
Editors:
Selene Baez Santamaria, Sai Ashish Somayajula, Atsuki Yamaguchi
Venue:
EACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
483–492
Language:
URL:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.36/
DOI:
Bibkey:
Cite (ACL):
Animesh Gurjar and Nikhil Krishnaswamy. 2026. Scale Is All You Need: Analyzing Modality Interaction and Speaker Intent Without Fine-Tuning. In Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 4: Student Research Workshop), pages 483–492, Rabat, Morocco. Association for Computational Linguistics.
Cite (Informal):
Scale Is All You Need: Analyzing Modality Interaction and Speaker Intent Without Fine-Tuning (Gurjar & Krishnaswamy, EACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-eacl/2026.eacl-srw.36.pdf