Yuzhe Liang
2026
MeanAudio: Fast and Faithful Text-to-Audio Generation with Mean Flows
Xiquan Li | Junxi Liu | Yuzhe Liang | Zhikang Niu | Wenxi Chen | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Xiquan Li | Junxi Liu | Yuzhe Liang | Zhikang Niu | Wenxi Chen | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Recent years have witnessed remarkable progress in Text-to-Audio Generation (TTA), providing sound creators with powerful tools to transform inspirations into vivid audio. Yet despite these advances, current TTA systems often suffer from slow inference speed, which greatly hinders the efficiency and smoothness of audio creation. In this paper, we present MeanAudio, a fast and faithful text-to-audio generator capable of rendering realistic sound with only one function evaluation (1-NFE). MeanAudio leverages: (i) the MeanFlow objective with guided velocity target that significantly accelerates inference speed, (ii) an enhanced Flux-style transformer with dual text encoders for better semantic alignment and synthesis quality, and (iii) an efficient instantaneous-to-mean curriculum that speeds up convergence and enables training on consumer-grade GPUs. Through a comprehensive evaluation study, we demonstrate that MeanAudio achieves state-of-the-art performance in single-step audio generation. Specifically, it achieves a real-time factor (RTF) of 0.013 on a single NVIDIA RTX 3090, yielding a 100x speedup over SOTA diffusion-based TTA systems. Moreover, MeanAudio also shows strong performance in multi-step generation, enabling smooth transitions across successive synthesis steps.
UniSonate: A Unified Model for Speech, Music, and Sound Effect Generation with Text Instructions
Chunyu Qiang | Xiaopeng Wang | Kang Yin | Yuzhe Liang | Yuxin Guo | Teng Ma | Ziyu Zhang | Tianrui Wang | Cheng Gong | Yushen Chen | Ruibo Fu | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Chunyu Qiang | Xiaopeng Wang | Kang Yin | Yuzhe Liang | Yuxin Guo | Teng Ma | Ziyu Zhang | Tianrui Wang | Cheng Gong | Yushen Chen | Ruibo Fu | Longbiao Wang | Jianwu Dang
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Generative audio modeling has largely been fragmented into specialized tasks, text-to-speech (TTS), text-to-music (TTM), and text-to-audio (TTA), each operating under heterogeneous control paradigms. Unifying these modalities remains a fundamental challenge due to the intrinsic dissonance between structured semantic representations (speech/music) and unstructured acoustic textures (sound effects). In this paper, we introduce UniSonate, a unified flow-matching framework capable of synthesizing speech, music, and sound effects through a standardized, reference-free natural language instruction interface. To reconcile structural disparities, we propose a novel dynamic token injection mechanism that projects unstructured environmental sounds into a structured temporal latent space, enabling precise duration control within a phoneme-driven Multimodal Diffusion Transformer (MM-DiT). Coupled with a multi-stage curriculum learning strategy, this approach effectively mitigates cross-modal optimization conflicts. Extensive experiments demonstrate that UniSonate achieves state-of-the-art performance in instruction-based TTS (WER 1.47%) and TTM (SongEval Coherence 3.18), while maintaining competitive fidelity in TTA. Crucially, we observe positive transfer, where joint training on diverse audio data significantly enhances structural coherence and prosodic expressiveness compared to single-task baselines.
SAC: Neural Speech Codec with Semantic-Acoustic Dual-Stream Quantization
Wenxi Chen | Ruiqi Yan | Yushen Chen | Zhikang Niu | Ziyang Ma | Xiquan Li | Yuzhe Liang | Wenhanlin | Shunshun Yin | Ming Tao | Xinsheng Wang | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Wenxi Chen | Ruiqi Yan | Yushen Chen | Zhikang Niu | Ziyang Ma | Xiquan Li | Yuzhe Liang | Wenhanlin | Shunshun Yin | Ming Tao | Xinsheng Wang | Xie Chen
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)
Speech codecs that convert continuous speech signals into discrete tokens have become essential for speech language models. However, existing codecs struggle to balance high-quality reconstruction with semantically rich representations, limiting their effectiveness in both generative and understanding tasks. In this work, we propose SAC, a neural speech codec with semantic-acoustic dual-stream quantization. By disentangling semantic and acoustic modeling into two dedicated streams, SAC enables each to be optimized for its respective role. Comprehensive evaluations show that SAC achieves strong reconstruction performance across diverse bitrates under both clean and noisy conditions, with particularly high scores on UTMOS and WER, indicating superior naturalness and intelligibility. Moreover, SAC substantially surpasses prior codecs in semantic representation, approaching the level of continuous self-supervised embeddings. When used as a tokenizer for LLM-based text-to-speech, SAC enables a single-stage autoregressive (AR) TTS model that clearly outperforms state-of-the-art AR systems. Our disentanglement analysis further validates the effectiveness of the dual-stream design, offering new potential for controllable speech generation.
2025
SLAM-Omni: Timbre-Controllable Voice Interaction System with Single-Stage Training
Wenxi Chen | Ziyang Ma | Ruiqi Yan | Yuzhe Liang | Xiquan Li | Ruiyang Xu | Zhikang Niu | Yanqiao Zhu | Yifan Yang | Zhanxun Liu | Kai Yu | Yuxuan Hu | Jinyu Li | Yan Lu | Shujie Liu | Xie Chen
Findings of the Association for Computational Linguistics: ACL 2025
Wenxi Chen | Ziyang Ma | Ruiqi Yan | Yuzhe Liang | Xiquan Li | Ruiyang Xu | Zhikang Niu | Yanqiao Zhu | Yifan Yang | Zhanxun Liu | Kai Yu | Yuxuan Hu | Jinyu Li | Yan Lu | Shujie Liu | Xie Chen
Findings of the Association for Computational Linguistics: ACL 2025
Recent advancements highlight the potential of end-to-end real-time spoken dialogue systems, showcasing their low latency and high quality. In this paper, we introduce SLAM-Omni, a timbre-controllable, end-to-end voice interaction system with single-stage training. SLAM-Omni achieves zero-shot timbre control by modeling spoken language with semantic tokens and decoupling speaker information to a vocoder. By predicting grouped speech semantic tokens at each step, our method significantly reduces the sequence length of audio tokens, accelerating both training and inference. Additionally, we propose historical text prompting to compress dialogue history, facilitating efficient multi-round interactions. Comprehensive evaluations reveal that SLAM-Omni outperforms prior models of similar scale, requiring only 15 hours of training on 4 GPUs with limited data. Notably, it is the first spoken dialogue system to achieve competitive performance with a single-stage training approach, eliminating the need for pre-training on TTS or ASR tasks. Further experiments validate its multilingual and multi-turn dialogue capabilities on larger datasets.
Search
Fix author
Co-authors
- Wenxi Chen 3
- Xie Chen 3
- Xiquan Li 3
- Zhikang Niu 3
- Yushen Chen 2
- Ruiqi Yan 2
- Jianwu Dang 1
- Ruibo Fu 1
- Cheng Gong 1
- Yuxin Guo 1
- Yuxuan Hu 1
- Jinyu Li 1
- Zhanxun Liu 1
- Shujie Liu 1
- Junxi Liu 1
- Yan Lu 1
- Ziyang Ma 1
- Teng Ma 1
- Ziyang Ma 1
- Chunyu Qiang 1
- Ming Tao 1
- Xiaopeng Wang 1
- Tianrui Wang 1
- Longbiao Wang 1
- Xinsheng Wang 1
- Wenhanlin 1
- Ruiyang Xu 1
- Yifan Yang 1
- Kang Yin 1
- Shunshun Yin 1
- Kai Yu 1
- Ziyu Zhang 1
- Yanqiao Zhu 1