Chin-Jou Li
2026
PRiSM: Benchmarking Phone Realization in Speech Models
Shikhar Bharadwaj | Chin-Jou Li | Yoonjae Kim | Kwanghee Choi | Eunjung Yeo | Ryan Soh-Eun Shim | Hanyu Zhou | Brendon Boldt | Karen Rosero | Kalvin Chang | Darsh Agrawal | Keer Xu | Chao-Han Huck Yang | Jian Zhu | Shinji Watanabe | David R. Mortensen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shikhar Bharadwaj | Chin-Jou Li | Yoonjae Kim | Kwanghee Choi | Eunjung Yeo | Ryan Soh-Eun Shim | Hanyu Zhou | Brendon Boldt | Karen Rosero | Kalvin Chang | Darsh Agrawal | Keer Xu | Chao-Han Huck Yang | Jian Zhu | Shinji Watanabe | David R. Mortensen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR systems still outperform LALMs. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability.
POWSM: A Phonetic Open Whisper-Style Speech Foundation Model
Chin-Jou Li | Kalvin Chang | Shikhar Bharadwaj | Eunjung Yeo | Kwanghee Choi | Jian Zhu | David R. Mortensen | Shinji Watanabe
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chin-Jou Li | Kalvin Chang | Shikhar Bharadwaj | Eunjung Yeo | Kwanghee Choi | Jian Zhu | David R. Mortensen | Shinji Watanabe
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent advances in spoken language processing have led to substantial progress in phonetic tasks such as automatic speech recognition (ASR), phone recognition (PR), grapheme-to-phoneme conversion (G2P), and phoneme-to-grapheme conversion (P2G). Despite their conceptual similarity, these tasks have largely been studied in isolation, each relying on task-specific architectures and datasets. In this paper, we introduce POWSM (Phonetic Open Whisper-style Speech Model), the first unified framework capable of jointly performing multiple phone-related tasks. POWSM enables seamless conversion between audio, text (graphemes), and phones, opening up new possibilities for universal and low-resource speech processing. Our model outperforms or matches specialized PR models of similar size (Wav2Vec2Phoneme and ZIPA) while jointly supporting G2P, P2G, and ASR. Our training data, models, and code are released to foster open science.
2025
Efficient Many-Shot In-Context Learning with Dynamic Block-Sparse Attention
Emily Xiao | Chin-Jou Li | Yilin Zhang | Graham Neubig | Amanda Bertsch
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Emily Xiao | Chin-Jou Li | Yilin Zhang | Graham Neubig | Amanda Bertsch
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Many-shot in-context learning has recently shown promise as an alternative to finetuning, with the major advantage that the same model can be served for multiple tasks. However, this shifts the computational burden from training-time to inference-time, making deployment of many-shot ICL challenging to justify in-practice. This cost is further increased if a custom demonstration set is retrieved for each inference example. We present Dynamic Block-Sparse Attention, an optimized method for retrieval-based many-shot in-context learning. By combining carefully designed block-sparse attention and retrieval of cached groups of demonstrations, we achieve comparable per-example latency to finetuning while maintaining on average >95% of the best method’s accuracy across strong ICL and finetuning baselines. We hope that this will further enable the deployment of many-shot ICL at scale.