Fuming You
2026
Evaluating the Expressive Appropriateness of Speech in Rich Contexts
Tianrui Wang | Ziyang Ma | Yizhou Peng | Haoyu Wang | Zhikang Niu | Zikang Huang | Yihao Wu | Yi-Wen Chao | Yu Jiang | Yuheng Lu | Guanrou Yang | Xuanchen Li | Hexin Liu | Chunyu Qiang | Cheng Gong | Yifan Yang | Tianchi Liu | Junyu Wang | Nana Hou | Meng Ge | Fuming You | Yang Wei | Zhongqian Sun | Hu Haifeng | Xiaobao Wang | Eng Siong Chng | Xie Chen | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianrui Wang | Ziyang Ma | Yizhou Peng | Haoyu Wang | Zhikang Niu | Zikang Huang | Yihao Wu | Yi-Wen Chao | Yu Jiang | Yuheng Lu | Guanrou Yang | Xuanchen Li | Hexin Liu | Chunyu Qiang | Cheng Gong | Yifan Yang | Tianchi Liu | Junyu Wang | Nana Hou | Meng Ge | Fuming You | Yang Wei | Zhongqian Sun | Hu Haifeng | Xiaobao Wang | Eng Siong Chng | Xie Chen | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluating expressive speech remains challenging, as existing methods mainly assess emotional intensity and overlook whether a speech sample is expressively appropriate for its contextual setting. This limitation hinders reliable evaluation of speech systems used in narrative-driven and interactive applications, such as audiobooks and conversational agents. We introduce CEAEval, a Context-rich framework for Evaluating Expressive Appropriateness in speech, which assesses whether a speech sample expressively aligns with the underlying communicative intent implied by its discourse-level narrative context. To support this task, we construct CEAEval-D, the first context-rich speech dataset with real human performances in Mandarin conversational speech, providing narrative descriptions together with fifteen dimensions of human annotations covering expressive attributes and expressive appropriateness. We further develop CEAEval-M, a model that integrates knowledge distillation, planner-based multi-model collaboration, adaptive audio attention bias, and reinforcement learning to perform context-rich expressive appropriateness evaluation. Experiments on a human-annotated test set demonstrate that CEAEval-M substantially outperforms existing speech evaluation and analysis systems.
2024
Prompt-Singer: Controllable Singing-Voice-Synthesis with Natural Language Prompt
Yongqi Wang | Ruofan Hu | Rongjie Huang | Zhiqing Hong | Ruiqi Li | Wenrui Liu | Fuming You | Tao Jin | Zhou Zhao
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Yongqi Wang | Ruofan Hu | Rongjie Huang | Zhiqing Hong | Ruiqi Li | Wenrui Liu | Fuming You | Tao Jin | Zhou Zhao
Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)
Recent singing-voice-synthesis (SVS) methods have achieved remarkable audio quality and naturalness, yet they lack the capability to control the style attributes of the synthesized singing explicitly. We propose Prompt-Singer, the first SVS method that enables attribute controlling on singer gender, vocal range and volume with natural language. We adopt a model architecture based on a decoder-only transformer with a multi-scale hierarchy, and design a range-melody decoupled pitch representation that enables text-conditioned vocal range control while keeping melodic accuracy. Furthermore, we explore various experiment settings, including different types of text representations, text encoder fine-tuning, and introducing speech data to alleviate data scarcity, aiming to facilitate further research. Experiments show that our model achieves favorable controlling ability and audio quality. Audio samples are available at http://prompt-singer.github.io .
Text-to-Song: Towards Controllable Music Generation Incorporating Vocal and Accompaniment
Zhiqing Hong | Rongjie Huang | Xize Cheng | Yongqi Wang | Ruiqi Li | Fuming You | Zhou Zhao | Zhimeng Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Zhiqing Hong | Rongjie Huang | Xize Cheng | Yongqi Wang | Ruiqi Li | Fuming You | Zhou Zhao | Zhimeng Zhang
Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to exploring song synthesis. In this work, we propose a novel task called Text-to-Song synthesis which incorporates both vocal and accompaniment generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.
Search
Fix author
Co-authors
- Zhiqing Hong 2
- Rongjie Huang 2
- Ruiqi Li 2
- Yongqi Wang 2
- Zhou Zhao 2
- Yi-Wen Chao 1
- Xie Chen 1
- Xize Cheng 1
- Eng Siong Chng 1
- Jianwu Dang 1
- Meng Ge 1
- Cheng Gong 1
- Hu Haifeng 1
- Nana Hou 1
- Ruofan Hu 1
- Zikang Huang 1
- Yu Jiang 1
- Tao Jin 1
- Xuanchen Li 1
- Hexin Liu 1
- Tianchi Liu 1
- Wenrui Liu 1
- Yuheng Lu 1
- Ziyang Ma 1
- Zhikang Niu 1
- Yizhou Peng 1
- Chunyu Qiang 1
- Zhongqian Sun 1
- Haoyu Wang 1
- Junyu Wang 1
- Longbiao Wang 1
- Tianrui Wang 1
- Xiaobao Wang 1
- Yang Wei 1
- Yihao Wu 1
- Guanrou Yang 1
- Yifan Yang 1
- Zhimeng Zhang 1