Synergizing Semantic Anchors and Ordinal Smoothed Cross-Entropy for Speech Fluency Classification
Mulati Kahaer, Sirajahmat Ruzmamat, XuDong Pang, Subinuer Maimaitituerxun, Zaokere Kadeer, Abudurexiti Reheman, Wenwen Lu, Panpan Zheng, Aishan Wumaier
Abstract
Speech fluency is a core indicator of second language proficiency and a critical component of Computer-Assisted Pronunciation Training (CAPT) systems. Accurate assessment requires models to perceive both macroscopic speech flow trends and microscopic local anomalies. However, existing methods struggle to bridge the semantic gap between static expert priors and dynamic temporal representations, while often overlooking the inherent ordinal nature of fluency scores. To address these challenges, we first construct a set of expert features targeting fluency disruptions and rhythmic regularity to provide explicit linguistic priors. Building on this, we propose the Multimodal Multi-Stream Fusion Classification (MMSFC) network. It employs a Mutual Cross-Attention (MCA) mechanism that leverages these expert features as “semantic anchors” to actively guide Whisper’s temporal representations and integrate decoder contexts, achieving deep interaction between global priors and local dynamics. Furthermore, we propose the Ordinal Smoothed Cross-Entropy (OSCE) loss. By constructing distance-aware soft target distributions coupled with confidence-adaptive smoothing and boundary enhancement, OSCE explicitly models ordinal relationships to resolve boundary ambiguity. Experiments on SpeechOcean762 show MMSFC achieves 83.40% accuracy, significantly outperforming strong baselines. Notably, OSCE also demonstrates superior generalization potential in cross-domain CV and NLP tasks. Our code is available at https://github.com/speech26ai/MMSFCCode.- Anthology ID:
- 2026.findings-acl.1551
- Volume:
- Findings of the Association for Computational Linguistics: ACL 2026
- Month:
- July
- Year:
- 2026
- Address:
- San Diego, California, United States
- Editors:
- Maria Liakata, Viviane P. Moreira, Jiajun Zhang, David Jurgens
- Venue:
- Findings
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 31018–31029
- Language:
- URL:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1551/
- DOI:
- Cite (ACL):
- Mulati Kahaer, Sirajahmat Ruzmamat, XuDong Pang, Subinuer Maimaitituerxun, Zaokere Kadeer, Abudurexiti Reheman, Wenwen Lu, Panpan Zheng, and Aishan Wumaier. 2026. Synergizing Semantic Anchors and Ordinal Smoothed Cross-Entropy for Speech Fluency Classification. In Findings of the Association for Computational Linguistics: ACL 2026, pages 31018–31029, San Diego, California, United States. Association for Computational Linguistics.
- Cite (Informal):
- Synergizing Semantic Anchors and Ordinal Smoothed Cross-Entropy for Speech Fluency Classification (Kahaer et al., Findings 2026)
- PDF:
- https://preview.aclanthology.org/ingest-acl/2026.findings-acl.1551.pdf