Yifan Yang
Other people with similar names: Yifan Yang, Yifan Yang, Yifan Yang
Unverified author pages with similar names: Yifan Yang
2026
Evaluating the Expressive Appropriateness of Speech in Rich Contexts
Tianrui Wang | Ziyang Ma | Yizhou Peng | Haoyu Wang | Zhikang Niu | Zikang Huang | Yihao Wu | Yi-Wen Chao | Yu Jiang | Yuheng Lu | Guanrou Yang | Xuanchen Li | Hexin Liu | Chunyu Qiang | Cheng Gong | Yifan Yang | Tianchi Liu | Junyu Wang | Nana Hou | Meng Ge | Fuming You | Yang Wei | Zhongqian Sun | Hu Haifeng | Xiaobao Wang | Eng Siong Chng | Xie Chen | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianrui Wang | Ziyang Ma | Yizhou Peng | Haoyu Wang | Zhikang Niu | Zikang Huang | Yihao Wu | Yi-Wen Chao | Yu Jiang | Yuheng Lu | Guanrou Yang | Xuanchen Li | Hexin Liu | Chunyu Qiang | Cheng Gong | Yifan Yang | Tianchi Liu | Junyu Wang | Nana Hou | Meng Ge | Fuming You | Yang Wei | Zhongqian Sun | Hu Haifeng | Xiaobao Wang | Eng Siong Chng | Xie Chen | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluating expressive speech remains challenging, as existing methods mainly assess emotional intensity and overlook whether a speech sample is expressively appropriate for its contextual setting. This limitation hinders reliable evaluation of speech systems used in narrative-driven and interactive applications, such as audiobooks and conversational agents. We introduce CEAEval, a Context-rich framework for Evaluating Expressive Appropriateness in speech, which assesses whether a speech sample expressively aligns with the underlying communicative intent implied by its discourse-level narrative context. To support this task, we construct CEAEval-D, the first context-rich speech dataset with real human performances in Mandarin conversational speech, providing narrative descriptions together with fifteen dimensions of human annotations covering expressive attributes and expressive appropriateness. We further develop CEAEval-M, a model that integrates knowledge distillation, planner-based multi-model collaboration, adaptive audio attention bias, and reinforcement learning to perform context-rich expressive appropriateness evaluation. Experiments on a human-annotated test set demonstrate that CEAEval-M substantially outperforms existing speech evaluation and analysis systems.
SpeechLLM-as-Judges: Towards General and Interpretable Speech Quality Evaluation
Hui Wang | Jinghua Zhao | Yifan Yang | Shujie Liu | Junyang Chen | Yanzhe Zhang | Shiwan Zhao | Jinyu Li | Jiaming Zhou | Haoqin Sun | Yan Lu | Yong Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Hui Wang | Jinghua Zhao | Yifan Yang | Shujie Liu | Junyang Chen | Yanzhe Zhang | Shiwan Zhao | Jinyu Li | Jiaming Zhou | Haoqin Sun | Yan Lu | Yong Qin
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Generative speech technologies are progressing rapidly, but evaluating the perceptual quality of synthetic speech remains a core challenge. Existing methods typically rely on scalar scores or binary decisions, which lack interpretability and generalization across tasks and languages. We present SpeechLLM-as-Judges, a new paradigm for enabling large language models (LLMs) to conduct structured and explanation-based speech quality evaluation. To support this direction, we introduce SpeechEval, a large-scale dataset containing 32,207 multilingual speech clips and 128,754 annotations spanning four tasks: quality assessment, pairwise comparison, improvement suggestion, and deepfake detection. Based on this resource, we develop SQ-LLM, a speech-quality-aware LLM trained with chain-of-thought reasoning and reward optimization to improve capability. Experimental results show that SQ-LLM delivers strong performance across tasks and languages, revealing the potential of this paradigm for advancing speech quality evaluation. The relevant code, models, and data are publicly available at https://github.com/NKU-HLT/SpeechLLM-as-Judges.
Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training
Yifan Yang | Bing Han | Hui Wang | Wei Wang | Ziyang Ma | Long Zhou | Zengrui Jin | Guanrou Yang | Tianrui Wang | Xu Tan | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifan Yang | Bing Han | Hui Wang | Wei Wang | Ziyang Ma | Long Zhou | Zengrui Jin | Guanrou Yang | Tianrui Wang | Xu Tan | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Modeling fine-grained speaking styles remains challenging for language-speech representation pre-training, as existing speech-text models are typically trained with coarse captions or task-specific supervision, and scalable fine-grained style annotations are unavailable. We present FCaps, a large-scale dataset with fine-grained free-text style descriptions, encompassing 47k hours of speech and 19M fine-grained captions annotated via a novel end-to-end pipeline that directly grounds detailed captions in audio, thereby avoiding the error propagation caused by LLM-based rewriting in existing cascaded pipelines. Evaluations using LLM-as-a-judge demonstrate that our annotations surpass existing cascaded annotations in terms of correctness, coverage, and naturalness. Building on FCaps, we propose CLSP, a contrastive language-speech pre-trained model that integrates global and fine-grained supervision, enabling unified representations across multiple granularities. Extensive experiments demonstrate that CLSP learns fine-grained and multi-granular speech-text representations that perform reliably across global and fine-grained speech-text retrieval, zero-shot paralinguistic classification, and speech style similarity scoring, with strong alignment to human judgments. Code and dataset are publicly available at https://github.com/yfyeung/CLSP.
2025
GigaSpeech 2: An Evolving, Large-Scale and Multi-domain ASR Corpus for Low-Resource Languages with Automated Crawling, Transcription and Refinement
Yifan Yang | Zheshu Song | Jianheng Zhuo | Mingyu Cui | Jinpeng Li | Bo Yang | Yexing Du | Ziyang Ma | Xunying Liu | Ziyuan Wang | Ke Li | Shuai Fan | Kai Yu | Wei-Qiang Zhang | Guoguo Chen | Xie Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifan Yang | Zheshu Song | Jianheng Zhuo | Mingyu Cui | Jinpeng Li | Bo Yang | Yexing Du | Ziyang Ma | Xunying Liu | Ziyuan Wang | Ke Li | Shuai Fan | Kai Yu | Wei-Qiang Zhang | Guoguo Chen | Xie Chen
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
The evolution of speech technology has been spurred by the rapid increase in dataset sizes. Traditional speech models generally depend on a large amount of labeled training data, which is scarce for low-resource languages. This paper presents GigaSpeech 2, a large-scale, multi-domain, multilingual speech recognition corpus. It is designed for low-resource languages and does not rely on paired speech and text data. GigaSpeech 2 comprises about 30,000 hours of automatically transcribed speech, including Thai, Indonesian, and Vietnamese, gathered from unlabeled YouTube videos. We also introduce an automated pipeline for data crawling, transcription, and label refinement. Specifically, this pipeline involves Whisper for initial transcription, MMS for forced alignment, and multi-dimensional filtering for data quality assurance. A modified Noisy Student Training is developed to further refine flawed pseudo labels iteratively, thereby enhancing model performance. Experimental results on our manually transcribed evaluation set and two public test sets from Common Voice and FLEURS confirm our corpus’s high quality and broad applicability. Notably, ASR models trained on GigaSpeech 2 can reduce the word error rate for Thai, Indonesian, and Vietnamese on our challenging and realistic YouTube test set by 25% to 40% compared to Whisper large-v3, with merely 10% model parameters. Furthermore, our ASR models trained on GigaSpeech 2 yield superior performance compared to commercial services. We hope that our newly introduced corpus and pipeline will open a new avenue for low-resource speech recognition and significantly facilitate research in this area.
SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training
Wenxi Chen | Ziyang Ma | Ruiqi Yan | Yuzhe Liang | Xiquan Li | Ruiyang Xu | Zhikang Niu | Yanqiao Zhu | Yifan Yang | Zhanxun Liu | Kai Yu | Yuxuan Hu | Jinyu Li | Yan Lu | Shujie Liu | Xie Chen
Findings of the Association for Computational Linguistics: ACL 2025
Wenxi Chen | Ziyang Ma | Ruiqi Yan | Yuzhe Liang | Xiquan Li | Ruiyang Xu | Zhikang Niu | Yanqiao Zhu | Yifan Yang | Zhanxun Liu | Kai Yu | Yuxuan Hu | Jinyu Li | Yan Lu | Shujie Liu | Xie Chen
Findings of the Association for Computational Linguistics: ACL 2025
Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets.
Making LLMs Better Many-to-Many Speech-to-Text Translators with Curriculum Learning
Yexing Du | Youcheng Pan | Ziyang Ma | Bo Yang | Yifan Yang | Keqi Deng | Xie Chen | Yang Xiang | Ming Liu | Bing Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yexing Du | Youcheng Pan | Ziyang Ma | Bo Yang | Yifan Yang | Keqi Deng | Xie Chen | Yang Xiang | Ming Liu | Bing Qin
Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Multimodal Large Language Models (MLLMs) have achieved significant success in Speech-to-Text Translation (S2TT) tasks. While most existing research has focused on English-centric translation directions, the exploration of many-to-many translation is still limited by the scarcity of parallel data. To address this, we propose a three-stage curriculum learning strategy that leverages the machine translation capabilities of large language models and adapts them to S2TT tasks, enabling effective learning in low-resource settings. We trained MLLMs with varying parameter sizes (3B, 7B, and 32B) and evaluated the proposed strategy using the FLEURS and CoVoST-2 datasets. Experimental results show that the proposed strategy achieves state-of-the-art average performance in 15×14 language pairs, requiring fewer than 10 hours of speech data per language to achieve competitive results. The source code and models are released at https://github.com/yxduir/LLM-SRT.
Search
Fix author
Co-authors
- Xie Chen 5
- Ziyang Ma 3
- Yexing Du 2
- Jinyu Li 2
- Shujie Liu 2
- Yan Lu 2
- Ziyang Ma 2
- Zhikang Niu 2
- Tianrui Wang 2
- Hui Wang 2
- Bo Yang 2
- Guanrou Yang 2
- Kai Yu 2
- Yi-Wen Chao 1
- Guoguo Chen 1
- Wenxi Chen 1
- Junyang Chen 1
- Eng Siong Chng 1
- Mingyu Cui 1
- Jianwu Dang 1
- Keqi Deng 1
- Shuai Fan 1
- Meng Ge 1
- Cheng Gong 1
- Hu Haifeng 1
- Bing Han 1
- Nana Hou 1
- Yuxuan Hu 1
- Zikang Huang 1
- Yu Jiang 1
- Zengrui Jin 1
- Jinpeng Li 1
- Ke Li 1
- Xiquan Li 1
- Xuanchen Li 1
- Yuzhe Liang 1
- Xunying Liu 1
- Zhanxun Liu 1
- Hexin Liu 1
- Tianchi Liu 1
- Ming Liu 1
- Yuheng Lu 1
- Youcheng Pan 1
- Yizhou Peng 1
- Chunyu Qiang 1
- Yong Qin 1
- Bing Qin (秦兵) 1
- Zheshu Song 1
- Zhongqian Sun 1
- Haoqin Sun 1
- Xu Tan 1
- Ziyuan Wang 1
- Haoyu Wang 1
- Junyu Wang 1
- Xiaobao Wang 1
- Longbiao Wang 1
- Wei Wang 1
- Yang Wei 1
- Yihao Wu 1
- Yang Xiang 1
- Ruiyang Xu 1
- Ruiqi Yan 1
- Fuming You 1
- Wei-Qiang Zhang 1
- Yanzhe Zhang 1
- Jinghua Zhao 1
- Shiwan Zhao 1
- Jiaming Zhou 1
- Long Zhou 1
- Yanqiao Zhu 1
- Jianheng Zhuo 1