Tony Woo
2026
WoW-Bench: Evaluating Fine-Grained Acoustic Perception in Audio-Language Models via Marine Mammal Vocalizations
Jaeyeon Kim | Heeseung Yun | Tony Woo | Chao-Han Huck Yang | Gunhee Kim
Findings of the Association for Computational Linguistics: ACL 2026
Jaeyeon Kim | Heeseung Yun | Tony Woo | Chao-Han Huck Yang | Gunhee Kim
Findings of the Association for Computational Linguistics: ACL 2026
Large audio-language models (LALMs) extend language understanding into the auditory domain, yet their ability to perform low-level listening, such as pitch and duration detection, remains underexplored. However, low-level listening is critical for real-world, out-of-distribution tasks where models must reason about unfamiliar sounds based on fine-grained acoustic cues. To address this gap, we introduce the World-of-Whale benchmark (WoW-Bench) to evaluate low-level auditory perception and cognition using marine mammal vocalizations. We use marine mammal vocalizations as out-of-distribution sound events to better assess models’ low-level listening and so that the models do not rely on prior knowledge of the sound events. WoW-bench is composed of a Perception benchmark for categorizing novel sounds and a Cognition benchmark, inspired by Bloom’s taxonomy, to assess the abilities to remember, understand, apply, and analyze sound events. For the Cognition benchmark, we additionally introduce distractor questions to evaluate whether models are truly solving problems through listening rather than relying on other heuristics. Experiments with state-of-the-art LALMs show performance far below human levels, indicating a need for stronger auditory grounding in LALMs.
2025
Think, Verbalize, then Speak: Bridging Complex Thoughts and Comprehensible Speech
Tony Woo | Sehun Lee | Kang-wook Kim | Gunhee Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Tony Woo | Sehun Lee | Kang-wook Kim | Gunhee Kim
Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing
Spoken dialogue systems increasingly employ large language models (LLMs) to leverage their advanced reasoning capabilities. However, direct application of LLMs in spoken communication often yield suboptimal results due to mismatches between optimal textual and verbal delivery. While existing approaches adapt LLMs to produce speech-friendly outputs, their impact on reasoning performance remains underexplored. In this work, we propose **Think-Verbalize-Speak**, a framework that decouples reasoning from spoken delivery to preserve the full reasoning capacity of LLMs. Central to our method is *verbalizing*, an intermediate step that translates thoughts into natural, speech-ready text. We also introduce **ReVerT**, a latency-efficient verbalizer based on incremental and asynchronous summarization. Experiments across multiple benchmarks show that our method enhances speech naturalness and conciseness with minimal impact on reasoning. The project page with the dataset and the source code is available at https://yhytoto12.github.io/TVS-ReVerT.