Yusuke Hirota
2026
Speech-Hands: A Self-Reflection Voice Agentic Approach to Speech Recognition and Audio Reasoning with Omni Perception
Zhen Wan | Chao-Han Huck Yang | Jinchuan Tian | Hanrong Ye | Ankita Pasad | Szu-Wei Fu | Arushi Goel | Ryo Hachiuma | Shizhe Diao | Kunal Dhawan | Sreyan Ghosh | Yusuke Hirota | Zhehuai Chen | Rafael Valle | Chenhui Chu | Shinji Watanabe | Boris Ginsburg | Yu-Chiang Frank Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhen Wan | Chao-Han Huck Yang | Jinchuan Tian | Hanrong Ye | Ankita Pasad | Szu-Wei Fu | Arushi Goel | Ryo Hachiuma | Shizhe Diao | Kunal Dhawan | Sreyan Ghosh | Yusuke Hirota | Zhehuai Chen | Rafael Valle | Chenhui Chu | Shinji Watanabe | Boris Ginsburg | Yu-Chiang Frank Wang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
We introduce a voice-agentic framework that learns one critical omni-understanding skill: knowing when to trust itself versus when to consult external audio perception. Our work is motivated by a crucial yet counterintuitive finding: naively fine-tuning an omni-model on both speech recognition and external sound understanding tasks often degrades performance, as the model can be easily misled by noisy hypotheses. To address this, our framework, Speech-Hands, recasts the problem as an explicit self-reflection decision. This learnable reflection primitive proves effective in preventing the model from being derailed by flawed external candidates. We show that this agentic action mechanism generalizes naturally from speech recognition to complex, multiple-choice audio reasoning. Across the OpenASR leaderboard, which includes seven domain-diverse speech datasets, Speech-Hands consistently outperforms strong baselines by 12.1% WER on the OpenASR benchmark. The model also achieves 77.37% accuracy and high F1 on audio QA decisions, showing robust generalization and reliability across diverse audio question answering datasets. By unifying perception and decision-making, our work offers a practical path toward more reliable and resilient audio intelligence.
2025
LOTUS: A Leaderboard for Detailed Image Captioning from Quality to Societal Bias and User Preferences
Yusuke Hirota | Boyi Li | Ryo Hachiuma | Yueh-Hua Wu | Boris Ivanovic | Marco Pavone | Yejin Choi | Yu-Chiang Frank Wang | Yuta Nakashima | Chao-Han Huck Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Yusuke Hirota | Boyi Li | Ryo Hachiuma | Yueh-Hua Wu | Boris Ivanovic | Marco Pavone | Yejin Choi | Yu-Chiang Frank Wang | Yuta Nakashima | Chao-Han Huck Yang
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 6: Industry Track)
Large Vision-Language Models (LVLMs) have transformed image captioning, shifting from concise captions to detailed descriptions. We introduce LOTUS, a leaderboard for evaluating detailed captions, addressing three main gaps in existing evaluations: lack of standardized criteria, bias-aware assessments, and user preference considerations. LOTUS comprehensively evaluates various aspects, including caption quality (e.g., alignment, descriptiveness), risks (e.g., hallucination), and societal biases (e.g., gender bias) while enabling preference-oriented evaluations by tailoring criteria to diverse user preferences. Our analysis of recent LVLMs reveals no single model excels across all criteria, while correlations emerge between caption detail and bias risks. Preference-oriented evaluations demonstrate that optimal model selection depends on user priorities.
2024
From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment
Yusuke Hirota | Ryo Hachiuma | Chao-Han Huck Yang | Yuta Nakashima
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Yusuke Hirota | Ryo Hachiuma | Chao-Han Huck Yang | Yuta Nakashima
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Large language models (LLMs) have enhanced the capacity of vision-language models to caption visual text. This generative approach to image caption enrichment further makes textual captions more descriptive, improving alignment with the visual context. However, while many studies focus on the benefits of generative caption enrichment (GCE), are there any negative side effects? We compare standard-format captions and recent GCE processes from the perspectives of gender bias and hallucination, showing that enriched captions suffer from increased gender bias and hallucination. Furthermore, models trained on these enriched captions amplify gender bias by an average of 30.9% and increase hallucination by 59.5%. This study serves as a caution against the trend of making captions more descriptive.
Resampled Datasets Are Not Enough: Mitigating Societal Bias Beyond Single Attributes
Yusuke Hirota | Jerone Andrews | Dora Zhao | Orestis Papakyriakopoulos | Apostolos Modas | Yuta Nakashima | Alice Xiang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
Yusuke Hirota | Jerone Andrews | Dora Zhao | Orestis Papakyriakopoulos | Apostolos Modas | Yuta Nakashima | Alice Xiang
Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing
We tackle societal bias in image-text datasets by removing spurious correlations between protected groups and image attributes. Traditional methods only target labeled attributes, ignoring biases from unlabeled ones. Using text-guided inpainting models, our approach ensures protected group independence from all attributes and mitigates inpainting biases through data filtering. Evaluations on multi-label image classification and image captioning tasks show our method effectively reduces bias without compromising performance across various models. Specifically, we achieve an average societal bias reduction of 46.1% in leakage-based bias metrics for multi-label classification and 74.8% for image captioning.
Search
Fix author
Co-authors
- Ryo Hachiuma 3
- Yuta Nakashima 3
- Chao-Han Huck Yang 3
- Yu-Chiang Frank Wang 2
- Jerone Andrews 1
- Zhehuai Chen 1
- Yejin Choi 1
- Chenhui Chu 1
- Kunal Dhawan 1
- Shizhe Diao 1
- Szu-Wei Fu 1
- Sreyan Ghosh 1
- Boris Ginsburg 1
- Arushi Goel 1
- Boris Ivanovic 1
- Boyi Li 1
- Apostolos Modas 1
- Orestis Papakyriakopoulos 1
- Ankita Pasad 1
- Marco Pavone 1
- Jinchuan Tian 1
- Rafael Valle 1
- Zhen Wan 1
- Shinji Watanabe 1
- Yueh-Hua Wu 1
- Alice Xiang 1
- Hanrong Ye 1
- Dora Zhao 1