Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition

Girish, Mohd Mujtaba Akhtar, Muskaan Singh


Abstract
In this work, we introduce a paralinguistic supervision paradigm for low-resource multilingual speech emotion recognition (LRM-SER) that leverages non-verbal vocalizations to exploit prosody-centric emotion cues. Unlike conventional SER systems that rely heavily on labeled verbal speech and suffer from poor cross-lingual transfer, our approach reformulates LRM-SER as non-verbal-to-verbal transfer, where supervision from a labelled non-verbal source domain is adapted to unlabeled verbal speech across multiple target languages. To this end, we propose NOVA-ARC, a geometry-aware framework that models affective structure in the Poincaré ball, discretizes paralinguistic patterns via a hyperbolic vector-quantized prosody codebook, and captures emotion intensity through a hyperbolic emotion lens. For unsupervised adaptation, NOVA-ARC performs optimal-transport-based prototype alignment between source emotion prototypes and target utterances, inducing soft supervision for unlabeled speech while being stabilized through consistency regularization. Experiments show that NOVA-ARC delivers the strongest performance under both non-verbal-to-verbal adaptation and the complementary verbal-to-verbal transfer setting, consistently outperforming Euclidean counter parts and strong SSL baselines. To the best of our knowledge, this work is the first to move beyond verbal-speech–centric supervision by introducing a non-verbal–to–verbal transfer paradigm for SER.
Anthology ID:
2026.acl-long.1940
Volume:
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Month:
July
Year:
2026
Address:
San Diego, California, United States
Editors:
Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
Venue:
ACL
SIG:
Publisher:
Association for Computational Linguistics
Note:
Pages:
41881–41893
Language:
URL:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1940/
DOI:
Bibkey:
Cite (ACL):
Girish, Mohd Mujtaba Akhtar, and Muskaan Singh. 2026. Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition. In Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 41881–41893, San Diego, California, United States. Association for Computational Linguistics.
Cite (Informal):
Prosody as Supervision: Bridging the Non-Verbal–Verbal for Multilingual Speech Emotion Recognition (Girish et al., ACL 2026)
Copy Citation:
PDF:
https://preview.aclanthology.org/ingest-acl/2026.acl-long.1940.pdf
Checklist:
 2026.acl-long.1940.checklist.pdf