Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment
Kwanghee Choi, Eunjung Yeo, Kalvin Chang, Shinji Watanabe, David R Mortensen
Abstract
Allophony refers to the variation in the phonetic realization of a phoneme based on its phonetic environment. Modeling allophones is crucial for atypical pronunciation assessment, which involves distinguishing atypical from typical pronunciations. However, recent phoneme classifier-based approaches often simplify this by treating various realizations as a single phoneme, bypassing the complexity of modeling allophonic variation. Motivated by the acoustic modeling capabilities of frozen self-supervised speech model (S3M) features, we propose MixGoP, a novel approach that leverages Gaussian mixture models to model phoneme distributions with multiple subclusters. Our experiments show that MixGoP achieves state-of-the-art performance across four out of five datasets, including dysarthric and non-native speech. Our analysis further suggests that S3M features capture allophonic variation more effectively than MFCCs and Mel spectrograms, highlighting the benefits of integrating MixGoP with S3M features.- Anthology ID:
- 2025.naacl-long.132
- Volume:
- Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
- Month:
- April
- Year:
- 2025
- Address:
- Albuquerque, New Mexico
- Editors:
- Luis Chiruzzo, Alan Ritter, Lu Wang
- Venue:
- NAACL
- SIG:
- Publisher:
- Association for Computational Linguistics
- Note:
- Pages:
- 2613–2628
- Language:
- URL:
- https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.132/
- DOI:
- Cite (ACL):
- Kwanghee Choi, Eunjung Yeo, Kalvin Chang, Shinji Watanabe, and David R Mortensen. 2025. Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 2613–2628, Albuquerque, New Mexico. Association for Computational Linguistics.
- Cite (Informal):
- Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment (Choi et al., NAACL 2025)
- PDF:
- https://preview.aclanthology.org/fix-sig-urls/2025.naacl-long.132.pdf