Szu-Wei Fu
2026
Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception
Zhen Wan | Chao-Han Huck Yang | Jinchuan Tian | Hanrong Ye | Ankita Pasad | Szu-Wei Fu | Arushi Goel | Ryo Hachiuma | Shizhe Diao | Kunal Dhawan | Sreyan Ghosh | Yusuke Hirota | Zhehuai Chen | Rafael Valle | Chenhui Chu | Shinji Watanabe | Boris Ginsburg | Yu-Chiang Frank Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhen Wan | Chao-Han Huck Yang | Jinchuan Tian | Hanrong Ye | Ankita Pasad | Szu-Wei Fu | Arushi Goel | Ryo Hachiuma | Shizhe Diao | Kunal Dhawan | Sreyan Ghosh | Yusuke Hirota | Zhehuai Chen | Rafael Valle | Chenhui Chu | Shinji Watanabe | Boris Ginsburg | Yu-Chiang Frank Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, which includes seven domain-diverse speech datasets, Speech-Hands consistently outperforms strong baselines by 12.1% WER on the OpenASR benchmark. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.
2025
NeKo: Cross-Modality Post-Recognition Error Correction with Tasks-Guided Mixture-of-Experts Language Model
Yen-Ting Lin | Zhehuai Chen | Piotr Zelasko | Zhen Wan | Xuesong Yang | Zih-Ching Chen | Krishna C Puvvada | Ke Hu | Szu-Wei Fu | Jun Wei Chiu | Jagadeesh Balam | Boris Ginsburg | Yu-Chiang Frank Wang | Chao-Han Huck Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Yen-Ting Lin | Zhehuai Chen | Piotr Zelasko | Zhen Wan | Xuesong Yang | Zih-Ching Chen | Krishna C Puvvada | Ke Hu | Szu-Wei Fu | Jun Wei Chiu | Jagadeesh Balam | Boris Ginsburg | Yu-Chiang Frank Wang | Chao-Han Huck Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Construction of a general-purpose post-recognition error corrector poses a crucial question: how can we most effectively train a model on a large mixture of domain datasets? The answer would lie in learning dataset-specific features and digesting their knowledge in a single model. Previous methods achieve this by having separate correction language models, resulting in a significant increase in parameters. In this work, we present Mixture-of-Experts as a solution, highlighting that MoEs are much more than a scalability tool. We propose a Multi-Task Correction MoE, where we train the experts to become an “expert” of speech-to-text, language-to-text and vision-to-text datasets by learning to route each dataset’s tokens to its mapped expert. Experiments on the Open ASR Leaderboard show that we explore a new state-of-the-art performance by achieving an average relative 5.0% WER reduction and substantial improvements in BLEU scores for speech and translation tasks. On zero-shot evaluation, NeKo outperforms GPT-3.5 and Claude-3.5-Sonnet with 15.5% to 27.6% relative WER reduction in the Hyporadise benchmark. NeKo performs competitively on grammar and post-OCR correction as a multi-task model.
2024
Relevance-aware Diverse Query Generation for Out-of-domain Text Ranking
Jia-Huei Ju | Huck Chao-Han Yang | Szu-Wei Fu | Ming-Feng Tsai | Chuan-Ju Wang
Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)
Jia-Huei Ju | Huck Chao-Han Yang | Szu-Wei Fu | Ming-Feng Tsai | Chuan-Ju Wang
Proceedings of the 9th Workshop on Representation Learning for NLP (RepL4NLP-2024)
Domain adaptation presents significant challenges for out-of-domain text ranking, especially when supervised data is limited. In this paper, we present ReadQG (Relevance-Aware Diverse Query Generation), a method to generate informative synthetic queries to facilitate the adaptation process of text ranking models. Unlike previous approaches focusing solely on relevant query generation, our ReadQG generates diverse queries with continuous relevance scores. Specifically, we propose leveraging soft-prompt tuning and diverse generation objectives to control query generation according to the given relevance. Our experiments show that integrating negative queries into the learning process enhances the effectiveness of text ranking models in out-of-domain information retrieval (IR) benchmarks. Furthermore, we measure the quality of query generation, highlighting the underlying beneficial characteristics of negative queries. Our empirical results and analysis also shed light on potential directions for more advanced data augmentation in IR. The data and code have been released.
Search
Fix author
Co-authors
- Chao-Han Huck Yang 3
- Zhehuai Chen 2
- Boris Ginsburg 2
- Zhen Wan 2
- Yu-Chiang Frank Wang 2
- Jagadeesh Balam 1
- Zih-Ching Chen 1
- Jun Wei Chiu 1
- Chenhui Chu 1
- Kunal Dhawan 1
- Shizhe Diao 1
- Sreyan Ghosh 1
- Arushi Goel 1
- Ryo Hachiuma 1
- Yusuke Hirota 1
- Ke Hu 1
- Jia-Huei Ju 1
- Yen-Ting Lin 1
- Ankita Pasad 1
- Krishna C Puvvada 1
- Jinchuan Tian 1
- Ming-Feng Tsai 1
- Rafael Valle 1
- Chuan-Ju Wang 1
- Shinji Watanabe 1
- Xuesong Yang 1
- Hanrong Ye 1
- Piotr Żelasko 1