Kwanghee Choi
2026
PRiSM: Benchmarking Phone Realization in Speech Models
Shikhar Bharadwaj | Chin-Jou Li | Yoonjae Kim | Kwanghee Choi | Eunjung Yeo | Ryan Soh-Eun Shim | Hanyu Zhou | Brendon Boldt | Karen Rosero | Kalvin Chang | Darsh Agrawal | Keer Xu | Chao-Han Huck Yang | Jian Zhu | Shinji Watanabe | David R. Mortensen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shikhar Bharadwaj | Chin-Jou Li | Yoonjae Kim | Kwanghee Choi | Eunjung Yeo | Ryan Soh-Eun Shim | Hanyu Zhou | Brendon Boldt | Karen Rosero | Kalvin Chang | Darsh Agrawal | Keer Xu | Chao-Han Huck Yang | Jian Zhu | Shinji Watanabe | David R. Mortensen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR systems still outperform LALMs. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability.
POWSM: A Phonetic Open Whisper-Style Speech Foundation Model
Chin-Jou Li | Kalvin Chang | Shikhar Bharadwaj | Eunjung Yeo | Kwanghee Choi | Jian Zhu | David R. Mortensen | Shinji Watanabe
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chin-Jou Li | Kalvin Chang | Shikhar Bharadwaj | Eunjung Yeo | Kwanghee Choi | Jian Zhu | David R. Mortensen | Shinji Watanabe
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, models, and code are released to foster open science.
2025
Leveraging Allophony in Self-Supervised Speech Models for Atypical Pronunciation Assessment
Kwanghee Choi | Eunjung Yeo | Kalvin Chang | Shinji Watanabe | David R Mortensen
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Kwanghee Choi | Eunjung Yeo | Kalvin Chang | Shinji Watanabe | David R Mortensen
Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Allophony refers to the variation in the phonetic realization of a phoneme based on its phonetic environment. Modeling allophones is crucial for atypical pronunciation assessment, which involves distinguishing atypical from typical pronunciations. However, recent phoneme classifier-based approaches often simplify this by treating various realizations as a single phoneme, bypassing the complexity of modeling allophonic variation. Motivated by the acoustic modeling capabilities of frozen self-supervised speech model (S3M) features, we propose MixGoP, a novel approach that leverages Gaussian mixture models to model phoneme distributions with multiple subclusters. Our experiments show that MixGoP achieves state-of-the-art performance across four out of five datasets, including dysarthric and non-native speech. Our analysis further suggests that S3M features capture allophonic variation more effectively than MFCCs and Mel spectrograms, highlighting the benefits of integrating MixGoP with S3M features.
2024
Wav2Gloss: Generating Interlinear Glossed Text from Speech
Taiqi He | Kwanghee Choi | Lindia Tjuatja | Nathaniel Robinson | Jiatong Shi | Shinji Watanabe | Graham Neubig | David Mortensen | Lori Levin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Taiqi He | Kwanghee Choi | Lindia Tjuatja | Nathaniel Robinson | Jiatong Shi | Shinji Watanabe | Graham Neubig | David Mortensen | Lori Levin
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Thousands of the world’s languages are in danger of extinction—a tremendous threat to cultural identities and human language diversity. Interlinear Glossed Text (IGT) is a form of linguistic annotation that can support documentation and resource creation for these languages’ communities. IGT typically consists of (1) transcriptions, (2) morphological segmentation, (3) glosses, and (4) free translations to a majority language. We propose Wav2Gloss: a task in which these four annotation components are extracted automatically from speech, and introduce the first dataset to this end, Fieldwork: a corpus of speech with all these annotations, derived from the work of field linguists, covering 37 languages, with standard formatting, and train/dev/test splits. We provide various baselines to lay the groundwork for future research on IGT generation from speech, such as end-to-end versus cascaded, monolingual versus multilingual, and single-task versus multi-task approaches.