StuD: A Multimodal Approach for Stuttering Detection with RAG and Fusion Strategies
Pragya Khanna, Priyanka Kommagouni, Vamshi Raghu Simha Narasinga, Anil Vuppala
Abstract
Stuttering is a complex speech disorder that challenges both ASR systems and clinical assessment. We propose a multimodal stuttering detection and classification model that integrates acoustic and linguistic features through a two-stage fusion mechanism. Fine-tuned Wav2Vec 2.0 and HuBERT extract acoustic embeddings, which are early fused with MFCC features to capture fine-grained spectral and phonetic variations, while Llama-2 embeddings from Whisper ASR transcriptions provide linguistic context. To enhance robustness against out-of-distribution speech patterns, we incorporate Retrieval-Augmented Generation or adaptive classification. Our model achieves state-of-the-art performance on SEP-28k and FluencyBank, demonstrating significant improvements in detecting challenging stuttering events. Additionally, our analysis highlights the complementary nature of acoustic and linguistic modalities, reinforcing the need for multimodal approaches in speech disorder detection.- Anthology ID:
- 2025.ijcnlp-long.39
- Volume:
- Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics
- Month:
- December
- Year:
- 2025
- Address:
- Mumbai, India
- Editors:
- Kentaro Inui, Sakriani Sakti, Haofen Wang, Derek F. Wong, Pushpak Bhattacharyya, Biplab Banerjee, Asif Ekbal, Tanmoy Chakraborty, Dhirendra Pratap Singh
- Venues:
- IJCNLP | AACL
- SIG:
- Publisher:
- The Asian Federation of Natural Language Processing and The Association for Computational Linguistics
- Note:
- Pages:
- 698–707
- Language:
- URL:
- https://preview.aclanthology.org/old-master/2025.ijcnlp-long.39/
- DOI:
- Cite (ACL):
- Pragya Khanna, Priyanka Kommagouni, Vamshi Raghu Simha Narasinga, and Anil Vuppala. 2025. StuD: A Multimodal Approach for Stuttering Detection with RAG and Fusion Strategies. In Proceedings of the 14th International Joint Conference on Natural Language Processing and the 4th Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics, pages 698–707, Mumbai, India. The Asian Federation of Natural Language Processing and The Association for Computational Linguistics.
- Cite (Informal):
- StuD: A Multimodal Approach for Stuttering Detection with RAG and Fusion Strategies (Khanna et al., IJCNLP-AACL 2025)
- PDF:
- https://preview.aclanthology.org/old-master/2025.ijcnlp-long.39.pdf