Hanyu Zhou
2026
PRiSM: Benchmarking Phone Realization in Speech Models
Shikhar Bharadwaj | Chin-Jou Li | Yoonjae Kim | Kwanghee Choi | Eunjung Yeo | Ryan Soh-Eun Shim | Hanyu Zhou | Brendon Boldt | Karen Rosero | Kalvin Chang | Darsh Agrawal | Keer Xu | Chao-Han Huck Yang | Jian Zhu | Shinji Watanabe | David R. Mortensen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Shikhar Bharadwaj | Chin-Jou Li | Yoonjae Kim | Kwanghee Choi | Eunjung Yeo | Ryan Soh-Eun Shim | Hanyu Zhou | Brendon Boldt | Karen Rosero | Kalvin Chang | Darsh Agrawal | Keer Xu | Chao-Han Huck Yang | Jian Zhu | Shinji Watanabe | David R. Mortensen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Phone recognition (PR) serves as the atomic interface for language-agnostic modeling for cross-lingual speech processing and phonetic analysis. Despite prolonged efforts in developing PR systems, current evaluations only measure surface-level transcription accuracy. We introduce PRiSM, the first open-source benchmark designed to expose blind spots in phonetic perception through intrinsic and extrinsic evaluation of PR systems. PRiSM standardizes transcription-based evaluation and assesses downstream utility in clinical, educational, and multilingual settings with transcription and representation probes. We find that diverse language exposure during training is key to PR performance, encoder-CTC models are the most stable, and specialized PR systems still outperform LALMs. PRiSM releases code, recipes, and datasets to move the field toward multilingual speech models with robust phonetic ability.
2024
Measuring Psychological Depth in Language Models
Fabrice Y Harel-Canada | Hanyu Zhou | Sreya Muppalla | Zeynep Senahan Yildiz | Miryung Kim | Amit Sahai | Nanyun Peng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Fabrice Y Harel-Canada | Hanyu Zhou | Sreya Muppalla | Zeynep Senahan Yildiz | Miryung Kim | Amit Sahai | Nanyun Peng
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Evaluations of creative stories generated by large language models (LLMs) often focus on objective properties of the text, such as its style, coherence, and diversity. While these metrics are indispensable, they do not speak to a story’s subjective, psychological impact from a reader’s perspective. We introduce the Psychological Depth Scale (PDS), a novel framework rooted in literary theory that measures an LLM’s ability to produce authentic and narratively complex stories that provoke emotion, empathy, and engagement. We empirically validate our framework by showing that humans can consistently evaluate stories based on PDS (0.72 Krippendorff’s alpha). We also explore techniques for automating the PDS to easily scale future analyses. GPT-4o, combined with a novel Mixture-of-Personas (MoP) prompting strategy, achieves an average Spearman correlation of 0.51 with human judgment while Llama-3-70B with constrained decoding scores as high as 0.68 for empathy. Finally, we compared the depth of stories authored by both humans and LLMs. Surprisingly, GPT-4 stories either surpassed or were statistically indistinguishable from highly-rated human-written stories sourced from Reddit. By shifting the focus from text to reader, the Psychological Depth Scale is a validated, automated, and systematic means of measuring the capacity of LLMs to connect with humans through the stories they tell.