Ming-Hao Hsu
2026
Linear Script Representations in Speech Foundation Models Enable Zero-Shot Transliteration
Ryan Soh-Eun Shim | Kwanghee Choi | Kalvin Chang | Ming-Hao Hsu | Florian Eichin | Zhizheng Wu | Alane Suhr | Michael A. Hedderich | David Harwath | David R. Mortensen | Barbara Plank
Findings of the Association for Computational Linguistics: ACL 2026
Ryan Soh-Eun Shim | Kwanghee Choi | Kalvin Chang | Ming-Hao Hsu | Florian Eichin | Zhizheng Wu | Alane Suhr | Michael A. Hedderich | David Harwath | David R. Mortensen | Barbara Plank
Findings of the Association for Computational Linguistics: ACL 2026
Multilingual speech foundation models such as Whisper are trained on web-scale data, where data for each language consists of a myriad of regional varieties. However, different regional varieties often employ different scripts to write the same language, rendering speech recognition output also subject to non-determinism in the output script. To mitigate this problem, we show that script is linearly encoded in the activation space of multilingual speech models, and that modifying activations at inference time enables direct control over output script. We find the addition of such script vectors to activations at test time can induce script change even in unconventional language-script pairings (e.g. Italian in Cyrillic and Japanese in Latin script). We apply this approach to inducing post-hoc control over the script of speech recognition output, where we observe competitive performance across all model sizes of Whisper.
2022
Controllable User Dialogue Act Augmentation for Dialogue State Tracking
Chun-Mao Lai | Ming-Hao Hsu | Chao-Wei Huang | Yun-Nung Chen
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Chun-Mao Lai | Ming-Hao Hsu | Chao-Wei Huang | Yun-Nung Chen
Proceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue
Prior work has demonstrated that data augmentation is useful for improving dialogue state tracking. However, there are many types of user utterances, while the prior method only considered the simplest one for augmentation, raising the concern about poor generalization capability. In order to better cover diverse dialogue acts and control the generation quality, this paper proposes controllable user dialogue act augmentation (CUDA-DST) to augment user utterances with diverse behaviors. With the augmented data, different state trackers gain improvement and show better robustness, achieving the state-of-the-art performance on MultiWOZ 2.1.