Tianrui Wang
2026
Towards Fine-Grained and Multi-Granular Contrastive Language-Speech Pre-training
Yifan Yang | Bing Han | Hui Wang | Wei Wang | Ziyang Ma | Long Zhou | Zengrui Jin | Guanrou Yang | Tianrui Wang | Xu Tan | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Yifan Yang | Bing Han | Hui Wang | Wei Wang | Ziyang Ma | Long Zhou | Zengrui Jin | Guanrou Yang | Tianrui Wang | Xu Tan | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Modeling fine-grained speaking styles remains challenging for language-speech representation pre-training, as existing speech-text models are typically trained with coarse captions or task-specific supervision, and scalable fine-grained style annotations are unavailable. We present FCaps, a large-scale dataset with fine-grained free-text style descriptions, encompassing 47k hours of speech and 19M fine-grained captions annotated via a novel end-to-end pipeline that directly grounds detailed captions in audio, thereby avoiding the error propagation caused by LLM-based rewriting in existing cascaded pipelines. Evaluations using LLM-as-a-judge demonstrate that our annotations surpass existing cascaded annotations in terms of correctness, coverage, and naturalness. Building on FCaps, we propose CLSP, a contrastive language-speech pre-trained model that integrates global and fine-grained supervision, enabling unified representations across multiple granularities. Extensive experiments demonstrate that CLSP learns fine-grained and multi-granular speech-text representations that perform reliably across global and fine-grained speech-text retrieval, zero-shot paralinguistic classification, and speech style similarity scoring, with strong alignment to human judgments. Code and dataset are publicly available at https://github.com/yfyeung/CLSP.
Evaluating the Expressive Appropriateness of Speech in Rich Contexts
Tianrui Wang | Ziyang Ma | Yizhou Peng | Haoyu Wang | Zhikang Niu | Zikang Huang | Yihao Wu | Yi-Wen Chao | Yu Jiang | Yuheng Lu | Guanrou Yang | Xuanchen Li | Hexin Liu | Chunyu Qiang | Cheng Gong | Yifan Yang | Tianchi Liu | Junyu Wang | Nana Hou | Meng Ge | Fuming You | Yang Wei | Zhongqian Sun | Hu Haifeng | Xiaobao Wang | Eng Siong Chng | Xie Chen | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Tianrui Wang | Ziyang Ma | Yizhou Peng | Haoyu Wang | Zhikang Niu | Zikang Huang | Yihao Wu | Yi-Wen Chao | Yu Jiang | Yuheng Lu | Guanrou Yang | Xuanchen Li | Hexin Liu | Chunyu Qiang | Cheng Gong | Yifan Yang | Tianchi Liu | Junyu Wang | Nana Hou | Meng Ge | Fuming You | Yang Wei | Zhongqian Sun | Hu Haifeng | Xiaobao Wang | Eng Siong Chng | Xie Chen | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Evaluating expressive speech remains challenging, as existing methods mainly assess emotional intensity and overlook whether a speech sample is expressively appropriate for its contextual setting. This limitation hinders reliable evaluation of speech systems used in narrative-driven and interactive applications, such as audiobooks and conversational agents. We introduce CEAEval, a Context-rich framework for Evaluating Expressive Appropriateness in speech, which assesses whether a speech sample expressively aligns with the underlying communicative intent implied by its discourse-level narrative context. To support this task, we construct CEAEval-D, the first context-rich speech dataset with real human performances in Mandarin conversational speech, providing narrative descriptions together with fifteen dimensions of human annotations covering expressive attributes and expressive appropriateness. We further develop CEAEval-M, a model that integrates knowledge distillation, planner-based multi-model collaboration, adaptive audio attention bias, and reinforcement learning to perform context-rich expressive appropriateness evaluation. Experiments on a human-annotated test set demonstrate that CEAEval-M substantially outperforms existing speech evaluation and analysis systems.
UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions
Chunyu Qiang | Xiaopeng Wang | Kang Yin | Yuzhe Liang | Yuxin Guo | Teng Ma | Ziyu Zhang | Tianrui Wang | Cheng Gong | Yushen Chen | Ruibo Fu | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chunyu Qiang | Xiaopeng Wang | Kang Yin | Yuzhe Liang | Yuxin Guo | Teng Ma | Ziyu Zhang | Tianrui Wang | Cheng Gong | Yushen Chen | Ruibo Fu | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines.
Search
Fix author
Co-authors
- Xie Chen 2
- Jianwu Dang 2
- Cheng Gong 2
- Ziyang Ma 2
- Chunyu Qiang 2
- Longbiao Wang 2
- Yifan Yang 2
- Guanrou Yang 2
- Yi-Wen Chao 1
- Yushen Chen 1
- Eng Siong Chng 1
- Ruibo Fu 1
- Meng Ge 1
- Yuxin Guo 1
- Hu Haifeng 1
- Bing Han 1
- Nana Hou 1
- Zikang Huang 1
- Yu Jiang 1
- Zengrui Jin 1
- Xuanchen Li 1
- Yuzhe Liang 1
- Hexin Liu 1
- Tianchi Liu 1
- Yuheng Lu 1
- Teng Ma 1
- Zhikang Niu 1
- Yizhou Peng 1
- Zhongqian Sun 1
- Xu Tan 1
- Hui Wang 1
- Wei Wang 1
- Haoyu Wang 1
- Junyu Wang 1
- Xiaobao Wang 1
- Xiaopeng Wang 1
- Yang Wei 1
- Yihao Wu 1
- Kang Yin 1
- Fuming You 1
- Ziyu Zhang 1
- Long Zhou 1
Venues
- ACL3